Google Anthos in Action (MEAP V13)

earn Anthos directly from the Google development team! Anthos delivers a consistent management platform for deploying an

590 43 16MB

English Pages 836 Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
welcome
1_Overview_of_Anthos
2_One_single_pane-of-glass
3_Computing_environment_built_on_Kubernetes
4_Anthos_Service_Mesh:_Security_and_observability_at_scale
5_Operations_management
6_Bringing_it_all_together
7_Hybrid_applications
8_Working_at_the_edge_and_telco_world
9_Serverless_compute_engine_(Knative)
10_Networking_environment
11_Config_management_architecture
12_Integrations_with_CI/CD
13_Security_and_policies
14_Marketplace
15_Migrate
16_Breaking_the_monolith
17_Compute_environment_running_on_Bare_Metal
Appendix_A_Cloud_is_a_new_computing_stack
Appendix_B_Lessons_from_the_field
Appendix_C_Compute_environment_running_on_VMware
Appendix_D_Data_and_analytics
Appendix_E_An_end-to-end_example_of_ML_application
Appendix_F_Compute_environment_running_on_Windows
Recommend Papers

Google Anthos in Action (MEAP V13)

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Google Anthos in Action MEAP V13 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.

welcome 1_Overview_of_Anthos 2_One_single_pane-of-glass 3_Computing_environment_built_on_Kubernetes 4_Anthos_Service_Mesh:_Security_and_observability_at_scale 5_Operations_management 6_Bringing_it_all_together 7_Hybrid_applications 8_Working_at_the_edge_and_telco_world 9_Serverless_compute_engine_(Knative) 10_Networking_environment 11_Config_management_architecture 12_Integrations_with_CI/CD 13_Security_and_policies 14_Marketplace 15_Migrate 16_Breaking_the_monolith 17_Compute_environment_running_on_Bare_Metal Appendix_A_Cloud_is_a_new_computing_stack Appendix_B_Lessons_from_the_field Appendix_C_Compute_environment_running_on_VMware Appendix_D_Data_and_analytics Appendix_E_An_end-to-end_example_of_ML_application Appendix_F_Compute_environment_running_on_Windows

welcome Thank you for purchasing the MEAP edition of Google Anthos in Action. I hope that what you'll get access to will be of immediate use to you for developing cloud software and, with your help, the final book will be amazing! To get the most benefit from this book, you need to have some understanding of cloud computing. Don't worry too much because we will review basic concepts such as microservice and service meshes during the initial chapters of the book. Whether you are a software developer, a businessperson interested in technology, a technology researcher, or simply a person who wants to know more about where the world of computers and people is going, it is essential to know more about Anthos. This book is an excellent way to get an understanding not just of a critical technology, but of the trends and thinking that led us to create Anthos. As such, it will also give you a good sense of the future. When we started to look at Anthos in Google, we were inspired by one single principle: Let's abstract the less-productive majority of the effort in running computing infrastructure operations and remove habitual constraints that are not really core to building and running software services. Kubernetes and the SRE principles for running managed services with DevOps were fundamental concepts used to build the foundations for our new product. Anthos is the additional major step towards making developers' work easier thanks to the creation of a customer-focused, run-anywhere distributed development platform that can automate utilization, can reduce expensive dependencies and facilitate the production state. This is really a big book, and I hope you find it as useful to read, as the authors, the curators, and the reviewers find it interesting to produce. My suggestion is that you deep dive into the initial five or six initial chapters and then pick the relevant follow up topics in the order you like.

Please make sure to post any questions, comments, or suggestions you have about the book in the liveBook discussion forum. Your feedback is essential in developing the best book possible. –Antonio Gulli In this book

Copyright 2023 Manning Publications welcome brief contents 1 Overview of Anthos 2 One single pane-of-glass 3 Computing environment built on Kubernetes 4 Anthos Service Mesh: Security and observability at scale 5 Operations management 6 Bringing it all together 7 Hybrid applications 8 Working at the edge and telco world 9 Serverless compute engine (Knative) 10 Networking environment 11 Config management architecture 12 Integrations with CI/CD 13 Security and policies 14 Marketplace 15 Migrate 16 Breaking the monolith 17 Compute environment running on Bare Metal Appendix A Cloud is a new computing stack Appendix B Lessons from the field Appendix C Compute environment running on VMware Appendix D Data and analytics Appendix E An end-to-end example of ML application Appendix F Compute environment running on Windows

1 Overview of Anthos This chapter includes: Anatomy of a Modern Application Accelerating Software Development with Anthos Standardizing Operations At-Scale with Anthos Origins at Google How to read this book Software has been eating the world for a while. As consumers, we are used to applications that make it faster, smarter and more efficient for us to do things like calling a cab or depositing a paycheck. Increasingly, our health, education, entertainment, social life and employment are all enhanced by modern software applications. At the other end of those applications is a chain of enterprises, large and small, that deliver these improved experiences, services and products. Modern applications are deployed not just in the hands of consumers, but also at points along this enterprise supply chain. Major transactional systems in many traditional industries such as retail, media, financial services, education, logistics and more are gradually being replaced by modern micro-services that auto-update frequently, scale efficiently and incorporate more real time intelligence. New digital-first startups are using this opportunity to disrupt traditional business models, while enterprise incumbents are rushing to modernize their systems so they can compete and avoid disruption. This book will take you through the anatomy of Anthos, the platform, the development environment, the elements of automation and scaling and the connection to patterns adapted from Google to attain excellence in modern software development in any industry. Each chapter includes practical examples on how to use the platform, and several include hands-on exercises to implement the techniques.

1.1 Anatomy of a Modern Application

What is a Modern Application? When you think of software that has improved your life, perhaps you think of applications that are interactive, fast (low latency), connected, intelligent, context aware, reliable, secure and easy to use on any device. As technology advances, the capabilities of modern applications, such as the level of security, reliability, awareness and intelligence advance as well. For example, new development frameworks such as react and angular have greatly enhanced the level interactivity of applications, new runtimes like node.js have increased functionality. Modern applications have the property of constantly getting better through frequent updates. On the backend these applications often comprise a number of services that are all continuously improving. This modularity is attained by breaking the older ‘monolith’ pattern for writing applications, where all the various functions were tightly coupled to each other. Applications written as a set of modular or micro-services, have several benefits: constituent services can be evolved independently or replaced with other more scalable or otherwise superior services over time. Also the modern microservices pattern is better at separating concerns and setting contracts between services making it easier to inspect and fix issues. This approach to writing, updating and deploying applications as ‘microservices’ that can be used together but also updated, scaled and debugged independently is at the heart of modern software development. In this book we refer to this pattern as “Modern” or “Cloud Native” application development. The term Cloud Native applies here because the microservices pattern is well suited to run on distributed infrastructure or Cloud. Microservices can be rolled out incrementally, scaled, revised, replaced, scheduled, rescheduled, and bin packed tightly on distributed servers creating a highly efficient, scalable, reliable system that is responsive and frequently updated. Modern applications can be written ‘greenfield’, from scratch, or re-factored from existing ‘brownfield’ applications by following a set of architectural and operational principles. The end goal of Application Modernization is typically revenue acceleration, and often this involves teams outside IT, in Line-ofBusiness units. IT organizations at most traditional enterprises have historically focused on reducing costs and optimizing operations, while cost reduction and optimized operations can be by-products of application

modernization they are not the most important benefit. Of course, the modernization process itself requires up front investment. Anthos is Google Cloud’s platform for application modernization in hybrid and multi-cloud environments. It provides the approach and technical foundation needed to attain high ROI application modernization. An IT strategy that emphasizes modularity through APIs, micro-services and cloud portability combined with a developer platform that automates reuse, experiments, and cost efficient scaling along with secure, reliable operations are the basic critical prerequisites for successful Application Modernization. The first part of Anthos is a modern developer experience that accelerates Line-of-Business application development. It is optimized for re-factoring brownfield apps, and writing microservices and API based applications. It offers unified local, on-prem and cloud development with event driven automation from source to production. It gives developers the ability to write code rapidly using modern languages and frameworks with local emulation and test, integrated CI/CD, and supports rapid iteration, experimentation and advanced roll out strategies. The Anthos developer experience emphasizes cloud APIs, containers and functions but is customizable by enterprise platform teams. A key goal of the Anthos developer experience is for teams to release code multiple times a day, thereby enhancing both velocity and reliability. Anthos features built in velocity and ROI metrics to help development teams measure and optimize their performance. Data driven benchmarks are augmented with pre-packaged best practice blueprints that can be deployed by teams to achieve the next level of performance. The second part of Anthos is an operator experience for central IT. Anthos shines as the uniquely scalable, streamlined way to run operations across multiple clouds. This is enabled by the remarkable foundation of technology invented and honed at Google over a period of twenty years, for running services with extremely high reliability on relatively low cost infrastructure. This is achieved through standardization of the infrastructure using a layer of abstraction comprising Kubernetes, Istio, Knative and several other building blocks along with Anthos specific extensions and integrations for automated configuration, security, and operations. The operator experience on Anthos offers advanced security and policy controls, automated declarative configuration, highly scalable service visualization and operations, and

automated resource and cost management. It features extensive automation, measurement and fault avoidance capabilities for high availability, secure service management across cloud, on-prem, edge, virtualized and bare metal infrastructure. Enterprise and small companies alike find that multi-cloud and edge is their new reality either organically or through acquisitions. Regulations in many countries require proven ability to migrate applications between clouds, and demonstrate failure tolerance with support for sovereignty. Non-regulated companies find multi-cloud necessary for providing developers’ choice and access to innovative services. Opportunities for running services and providing greater intelligence at the edge add further surfaces to the infrastructure footprint. Some IT organizations roll their own cross-cloud platform integrations, but this job gets harder every day. It is extremely difficult to build a cross cloud platform in a scalable, maintainable way, but more importantly, it detracts from precious developer time for product innovation. Anthos provides a solution rooted in years of time-tested experience and technical innovation at Google in SW development and SRE operations, augmented with Google Cloud’s experience managing infrastructure for modern applications across millions of enterprise customers. Anthos is unique in serving the needs of LoB developers and Central IT together and with advanced capabilities in both domains. Consistency of developer and operator experience across environments enables enterprises to obtain maximum ROI from Application Modernization with Anthos.

1.1.1 Accelerating Software Development Software product innovation and new customer experiences are the engine of new revenue generation in the digital economy. But the innovation process is such that only a few ideas lead to successful new products, most fail and disappear. As every industry transitions to being software driven, new product innovation becomes dependent on having a highly agile and productive software development process. Developers are the new Kingmakers. Without an agile, efficient development process and platform, companies can fail to innovate or innovate at very high costs and even

negative ROI. An extensive DevOps Research Assessment[1] study (DORA) surveyed over 30,000 IT professionals over several years across a variety of IT functions. It has shown that excellence in software development is a hallmark of business success. This is not surprising given the importance of modern applications in fueling the economy. DORA quantifies these benefits, showing that “Elite” or the highest performing software teams are 2X more effective in attaining revenue and business goals than Low performing teams. The distinguishing characteristic of Elite teams is the practice of releasing software frequently. DORA finds that four key metrics provide an accurate measurement of software development excellence. These are: Deployment frequency Lead time for changes Change fail rate Time to restore service High performance teams release software frequently, for example several times a day. In comparison low performers release less than once a month. The study also found that teams that release frequently have a lower software defect ratio and recover from errors more rapidly than others. As a result, in addition to being more innovative and modern, their software is more reliable and secure. Year over year DORA results also show that an increasing number of enterprises are investing in the tools and practices that enable Elite performance. Why do teams with higher Development Velocity have better business results? In general, higher velocity means that developers are able to experiment more, test more, and so they come up with a better answer in the same amount of time. But there is another reason. Teams with higher velocity have usually made writing and deploying code an automated, low effort process. This has the side effect of enabling more people to become developers, especially those who are more entrenched in the business vs. the tooling. As a result high velocity developer teams have more Line-ofBusiness thinking and a greater understanding of end user needs on the development team. The combination of rapid experimentation and focus on users yields better business results. Anthos is the common substrate layer that

runs across clouds to provide a common developer experience for accelerating application delivery.

1.1.2 Standardizing Operations At-Scale Developers may be the new Kingmakers, but Operations is the team that runs the kingdom day in and day out. Operations includes teams that provision, upgrade, manage, troubleshoot and scale all aspects of services, infrastructure and cloud. Typically networking, compute, storage, security, identity, asset management, billing and reliability engineering is part of the operations team of an enterprise. Traditional IT teams have anywhere from 15-30% of their staff in IT operations. This team is not always visibly engaged in new product introductions with the line-of-business, but often lays the groundwork, selecting clouds, publishing service catalogs and qualifying services for use by the business. Failing to invest in operations automation often means that operations become the bottleneck and a source of fixed cost. On the flip side, modernizing operations has a tremendous positive effect on velocity. Modern application development teams are typically supported by a very lean operations team, where 80%+ of staff is employed in software development vs. operations. Such a developer centric ratio is only achieved through modern infrastructure with scaled, automated operations. This means operations are extremely streamlined and leverage extensive automation to bring new services online quickly. Perhaps the greatest value of Anthos is in automating operations at scale consistently across environments. The scale and consistency of Anthos is enabled by a unique open cloud approach that has its origins in Google’s own infrastructure underpinning.

1.2 Origins in Google Google’s software development process has been optimized and fine tuned over many years to maximize developer productivity and innovation. This attracts the best software developers in the world and leads to a virtuous cycle of innovation in software and software development and delivery practices. The Anthos development stack has evolved from these foundations and is built on core open source technology that Google introduced to the industry.

At the heart of Anthos is Kubernetes, the extensive orchestration and automation model for managing infrastructure through the container abstraction layer. The layer above Kubernetes is grounded in Google’s Site Reliability Engineering or Operations practices, which standardize the control, security and management of services at scale. This layer of Service Management is rooted in Google’s Istio-based Cloud Service Mesh. Enterprise policy and configuration automation is built in at this layer using Anthos Configuration Management to provide automation and security at scale. This platform can run on multiple clouds and abstracts the disparate networking, storage and compute layers underneath (see Figure 1.1). Figure 1.1 Anthos components and functions

Above this Anthos stack is Developer experience and DevOps tooling including a deployment environment that uses Knative and integrated CICD with Tekton.

1.3 Summary Modern software applications provide a host of business benefits and are

driving transformation in many industries. The backend for these applications is typically based on the cloudnative, microservices architectural pattern which allows for great scalability, modularity and a host of operational and devops benefits that are well suited running on distributed infrastructure. Anthos, which originated in Google Cloud is a platform for hosting cloud native applications providing both a develop and an operational benefits. [1]

https://www.devops-research.com/research.html

2 One single pane-of-glass This chapter covers: The advantages of having a Single Pane of Glass and its components How different personas can use and benefit from the above components Get some hands on experience configuring the UI and attaching a cluster to the Anthos UI We live in a world where application performance is critical for success. To better serve their end-users, many organizations have pushed to distribute their workloads out of centralized data centers. Whether to be closer to their users, to enhance disaster recovery, or to leverage the benefits of cloud computing, this distribution has placed additional pressure on the tooling used to manage and support this strategy. The tools that have flourished under this new paradigm are those that have matured and become more sophisticated and scalable. There is no one-size-fits-all tool. Likewise, there is no one person who can manage the infrastructure of even a small organization. All applications require tools to manage CI/CD, monitoring, logging, orchestration, deployments, storage, authentication/authorization, and more. In addition to the scalability and sophistication mentioned above, most of the tools in this space offer an informative and user-friendly graphical user interface (GUI). Having an easily understood GUI can help people use the tool more effectively since it lowers the bar for learning the software and increases the amount of pertinent output the user receives. Anthos itself has the capacity to support hundreds of applications and thousands of services, so a high-quality GUI and a consolidated user experience is required to leverage the ecosystem to its full potential and reduce the operational overhead. To this end, Google Cloud Platform has developed a rich set of dashboards and integrated tools within the Google Cloud Console to help you monitor, troubleshoot, and interact with your deployed Anthos clusters, regardless of their location or infrastructure

provider. This single pane of glass allows administrators, operations professionals, developers, and business owners to view the status of their clusters and application workloads, all while leveraging the capabilities of Google Cloud’s Identity and Access Management (IAM) framework and any additional security provided on each cluster. The Anthos GUI, its "Single Pane of Glass", is not the first product to attempt to centralize the visibility and operations of a fleet of clusters, but it is the one offering support to provide real-time visibility to a large variety of environments. In order to fully understand the benefits of the Anthos GUI, in this chapter we are going to take a look at some of the other options available to aggregate and standardize the interactions with multiple Kubernetes clusters.

2.1 Single Pane of Glass A single pane of glass offers 3 main pillars of characteristics that are shared across all operators, industries and operations scales. These three pillars are: Centralization: As the name suggests, a single pane of glass should provide a central UI for resources, no matter where they run and to whom they are provided. The former aspect relates to which infrastructure and cloud provider the clusters are operating on and the latter relates to inherently multi-tenant services where one operator centrally manages multiple clients’ clusters and workloads. With the benefits of a central dashboard, admins will be able to get a high level view of resources and drill down to areas of interest without switching the view. However, a central environment might cause some concern in areas of privacy and security. Not every administrator is required to connect to all clusters and neither all admins should be able to have access to the UI. A central environment should come with its own safeguards to avoid any operational compromise with industry standards. Consistency: Let’s go back to the scenario of an operator running clusters and customers in multi-cloud or hybrid architectures. A majority of infrastructure providers, whether they offer proprietary services or run on

open source, attempt to offer a solid interface for their users. However, they use different terminology, have inconsistent views on the priorities. Finally, depending on their UI philosophy and approach, they design the view and navigation differently. Remember, for a cloud provider, cluster and container management is only part of the bigger suite of services and a member of a pre-designed dashboard. While this might be a positive element in single operating environments (you can learn to navigate outside of Kubernetes dashboard into the rest of cloud services dashboard with minimum switching), it becomes a headache in multi-environment services and those who mainly only focus on Kubernetes. Ease of use: Part of the appeal of a single pane of glass in operation is how data coming from different sources are aggregated, normalized and visualized. This brings a lot of simplicity in drilling down into performance management and triage especially if it combines a graphical interface with it. A graphical UI has always been an important part of any online application. First, at some point in an app management cycle, there is a team who doesn’t have neither the skills nor the interest of working with remote calls. This team expects a robust, easy to navigate and a highly device agnostic UI for their day to day responsibility. Second, regardless of the skillsets, an aggregated dashboard has so much more to offer in one concentrated view than calling service providers and perhaps clusters individually given that the UI provides lots of data fields with the right installation and readability.

2.2 Non-Anthos Visibility and Interaction Anthos is not the first solution to expose information about a Kubernetes cluster through a more easily digested form than the built-in APIs. While many developers and operators have used the command-line interface (CLI), kubectl, to interact with a cluster, the information presented can be very technical and does not usually help surface potential issues in a friendly way. Extensions to Kubernetes such as Istio or Anthos Configuration Management typically come with their own CLIs as well (istioctl and nomos, for example). Cross-referencing information between all the disparate tools can be a

substantial exercise, even for the most experienced developer or operator. Kubernetes Dashboard One of the first tools developed to solve this particular issue was the Kubernetes Dashboard[2]. While this utility is not deployed by default on new Kubernetes clusters, it is trivial to deploy to the cluster and begin utilizing the information it provides. In addition to providing a holistic view of most of the components of a Kubernetes cluster, the Dashboard also provides users with a GUI interface to deploy new workloads into the cluster. This makes the dashboard a convenient and quick way to view the status and interact with a new cluster. However, it only works on one cluster. While you can certainly deploy the Kubernetes Dashboard to each of your clusters, they will remain independent of each other and have no cross-connection. In addition, since the dashboard is located on the cluster itself, accessing it remotely requires a similar level of effort to using the CLI tool, requiring services, load balancing, and ingress rules to properly route and validate incoming traffic. While the dashboard can be powerful for proof-of-concept or small developer clusters, multi-user clusters need a more powerful tool. Provider-specific UIs Kubernetes was released from the beginning as an open-source project. While based on internal Google tools, the structure of Kubernetes allowed vendors and other cloud providers to easily create their own customized versions of Kubernetes, either to simplify deployment or management on their particular platform(s), or to add additional features. Many of these adaptations have customized UIs for either deployment or management operations. For cloud providers in particular, much of the existing user interfaces for their other products already existed and followed a particular style. Each provider developed a different UI for their particular version of Kubernetes. While a portion of these UIs dealt with provisioning and maintaining the infrastructure of a cluster, some of the UI was dedicated to cluster operations and manipulation. However, each UI was implemented differently, and

cannot manage clusters other than the native Kubernetes flavor for that particular cloud provider. Bespoke Software Some companies have decided to push the boundaries themselves and develop their own custom software and UIs to visualize and manage their Kubernetes installations and operations. While always an option due to the open standards of the Kubernetes APIs, any bespoke development brings all the associated challenges that come with maintaining any custom operations software: maintaining the software for new versions, bug fixing, handling OS and package upgrades, etc... For the highest degree of customization, nothing beats bespoke software, but the cost vs. benefit calculation does not work out for most companies.

2.3 The Anthos UI Each of the previous solutions has a fundamental flaw that prevents most companies from fully leveraging it. The Kubernetes Dashboard has no multicluster capability, and does not handle remote access easily. The providerspecific UIs work well for their flavor, but cannot handle clusters that are not on their network or running their version of Kubernetes. And bespoke software comes with a high cost of development and maintenance. This is where the Anthos multi-cluster single pane of glass comes into play. This single pane of glass is an extension of, and embedded in, Google Cloud Platform’s already extensive Cloud Console that allows users to view, monitor, and manage their entire cloud infrastructure and workloads. The solution Google has developed for multi-cluster visibility in Anthos depends on a new concept called Fleets (formerly referred to as Environs), the Connect framework and the Anthos dashboard. The Anthos dashboard is an enhancement of the existing GKE dashboard that Google has provided for several years for its in-cloud GKE clusters. The Connect framework is new with Anthos and simplifies the communication process between Google Cloud and clusters located anywhere in the world. fleets are methods of aggregating clusters together in order to simplify common work between

them. Let’s take a moment to discuss a bit more about fleets. Fleets Fleets are a Google Cloud concept for logically organizing clusters and other resources, letting you use and manage multi-cluster capabilities and apply consistent policies across your systems. Think of them as a grouping mechanism that applies several security and operation boundaries to resources within a single project[3]. They help administrators build a one-tomany relationship between a Fleet and its member clusters and resources to reduce the configuration burden of individual security and access rules. The clusters in a Fleet also exist in a higher trust relationship with each other by belonging to the same Fleet. This means that it is easier to manage traffic into and between the clusters and join their service meshes together. An Anthos cluster will belong to one and only one Fleet and cannot join another without leaving the first. Unfortunately, this can present a small problem in complex service communications. For example, assume we have an API service and a Data Processing service that need to run in distinct fleets for security reasons, but both need to talk to a bespoke Permissioning service. The Permissioning service can be placed in one of the two fleets, but whichever service does not belong to Permissioning's Fleet will need to talk to the service using outside-the-cluster networking. However, this rule for fleets prevents users from accidentally merging clusters that must remain separate, as allowing the common service to exist in both fleets simultaneously would open up additional attack vectors (see Figure 2.1). Figure 2.1 Example of fleet merging causing security issues.

When multiple clusters are in the same Fleet, many types of resources must have unique names, or they will be treated as the same resource. This obviously includes the clusters themselves, but also namespaces, services, and identities. Anthos refers to this as “sameness”. Sameness forces consistent ownership across all clusters within a Fleet, and namespaces that are defined on one cluster, but not on another, will be reserved implicitly. When designing the architecture of your services, this "sameness" concept must be kept in mind. Anthos Service Mesh, for example, will typically treat a service that exists in the same namespace with the same name as an identical service across the entire Fleet and will load balance traffic between clusters automatically. If the namespace and/or service in question have a fairly unique name, this should not cause any confusion. However, accessing

the "webserver" service in the "demo" namespace might yield unexpected results. Finally, Anthos allows all services to utilize a common identity when accessing external resources such as Google Cloud services, object stores, and so on. This common identity makes it possible to give the services within a fleet access to an external resource once, rather than cluster-by-cluster. While this can be overridden and multiple identities defined, if resources are not architected carefully and configured properly, negative outcomes can occur. Connect, How does it work? Now that we have discussed fleets, we need to examine how the individual clusters communicate with Google Cloud. Any cluster that is part of Anthos, whether Attached[4] or Anthos-managed, has Connect deployed to the cluster as part of the installation or registration process. This deployment establishes a persistent connection from the cluster outbound to Google Cloud that accepts traffic from the cloud to permit cloud-side operations secure access to the cluster. Since the initial connection is outbound, it does not rely on a fully routable connection from the cloud to the cluster. This greatly reduces the security considerations and does not require the cluster to be discoverable on the public internet. Once the persistent connection is established, Anthos can proxy requests made by Google Cloud services or users using the Google Cloud UI to the cluster whether it is located within Google Cloud, in another cloud provider, at the edge, or on-premise. These requests use the user’s or the service’s credentials, maintaining the security on the cluster and allowing the existing role-based access controls (RBAC)[5] rules to span both direct connectivity as well as connections through the proxy. A request using the Anthos UI may look like Figure 2.2: Figure 2.2 Flow of request and response from Google Cloud to Cluster and back

While the tunnel from the Connect agent to Google Cloud is persistent, each stage of each request is authenticated using various mechanisms to validate the identity of the requestor and that that particular layer is allowed to make the request. Skipping layers is not permitted and will be rejected by the next layer receiving the invalid request. An overview of the request-response authentication is seen in Figure 2.3: Figure 2.3 Request validation steps from Google Cloud to Cluster

Regardless of any authorization measures at the cluster-level, a user must still be allowed to view the Google Cloud project the cluster is attached to in order to use the Connect functionality. This uses the standard IAM processes for a given project, but having the separate permission allows the Security team to grant a user access to a cluster through a direct connection (or some other tunnel), but not allow them remote access via Google Cloud. Connect is compliant with Google’s Access Transparency[6] which provides transparency to the customer in 2 main areas: Access approval: Customers can authorize Google support staff to work on certain parts of their services. Customers can view the reasons why a Google employee might need that access. Activity visibility: Customers can import access logs into their project Cloud logging to have visibility into Google employees’ actions and

location and can query the logs in real-time, if necessary. Installation and Registration In order to leverage the Connect functionality, we obviously need to install the Connect agent on our cluster. We also need to inform Google about our cluster and determine which project, and therefore which Fleets, the cluster belongs to. Fortunately, Google has provided a streamlined utility for performing this task via the gcloud command line tool[7]. This process will utilize either Workload Identity or a Google Cloud Service Account to enroll the cluster with the project’s Connect pool and install and start the Connect agent on the cluster. While these steps have enrolled the cluster with Google and enabled most Anthos features, you still need to authenticate with the cluster from the Google Console in order to view and interact with the cluster from Google Cloud. Connect allows authentication via Cloud Identity (when using the Connect Gateway)[8], bearer token, or OIDC, if enabled on the cluster. The easiest, and recommended, method is to use Cloud Identity, but this requires the activation and configuration of the Connect Gateway for the cluster. For more information on Connect Gateway, please see the chapter on Operations Management with Anthos.

2.4 The Anthos Cloud UI Now that we’ve done the plumbing, we can actually walk through and show off the UI. Google provides the Anthos UI via the Cloud Console, at the project level. Since the Anthos UI is only visible at the project level, only clusters registered to that project’s Fleets are visible. The Anthos UI menu contains multiple sub-pages, each providing a focus on a distinct aspect of cluster management. At the time of writing, these sections are the Dashboard, Service Mesh, Config Management, Clusters, Features, Migrate to containers, Security, Cloud Run for Anthos and Virtual Machines. Let’s take a look at each of these pages: The Anthos Dashboard

The default page for the Anthos menu, and the central hub for the UI, is the Dashboard. The Dashboard is intended to give Admins a wide-angle view of the clusters in the current Fleet, while making it easy to drill down into details for the specific components. To start, go to the hamburger menu on the top left corner of the console (Figure 2.4). Find “Anthos” from the menu and click on it to enter the all-Anthos features page. Image 8 shows how to navigate to Anthos dashboard. Figure 2.4 Navigation to Anthos dashboard

Figure 2.5 shows an example of the Anthos Dashboard view. Figure 2.5 Example of an Anthos Dashboard

While this example shows the current Anthos project cost, the Dashboard still leverages Google’s IAM and that information will only appear if the viewing user has the appropriate billing-related permissions. The remaining sections highlight critical errors or other user-involved issues for that particular aspect of Anthos. Following those links takes you to the appropriate sub-page below. Service Mesh The Service Mesh page shows all services registered in any of the clusters in the current Fleet. The initial list shows the names, namespaces, and clusters of each service, as well as basic metrics such as error rate and latency at predefined thresholds. This list can also be filtered by namespace, cluster name, request per second, error rate, latency, request size and resource usage to allow admins to easily drill down for specific tasks. Figure 2.6 shows the Service Mesh screen filtered for services in the default namespace. Figure 2.6 Service Mesh UI with Filters.

Config Management Anthos Configuration Management, explored in depth in Chapter 12, is Anthos’ method of automatically adding and maintaining resources on a Kubernetes cluster. These resources can include most common Kubernetes core objects (such as Pods, Services, ServiceAccounts, etc…) as well as custom entities such as policies and cloud-configuration objects. This tab displays the list of all clusters in the current Fleet, their sync status, and which revision is currently enforced on the cluster. The table also shows whether Policy Controller[9] has been enabled for the cluster. Figure 2.7 Clusters in Config Management View

Selecting a specific cluster opens up the config management cluster detail as shown in Figure 2.8. This detailed view gives further information about the configuration settings, including the location of the repo used, the cycle for syncing, and the version of ACM running on the cluster. Figure 2.8 Cluster Detail in Configuration Management View

Clusters The Clusters menu lists all clusters in the current Fleet, along with the location, type, labels, and any warnings associated with each cluster as shown below in Figure 2.9. By selecting a cluster in the list, a more detailed view of

the cluster, with the current Kubernetes version, the CPU and Memory available, and the Features enabled, will be displayed in the right sidebar as shown in Figure 2.10. Below the sidebar information, a “Manage Features” button will take you to the Features tab for that cluster. In Figure 2.9, below clusters are created on the project: GKE (cluster-gcp) Baremetal (cluster-1) Azure AKS (azure-cluster and externalazure) Figure 2.9 List view in the Clusters Menu

Figure 2.10 Cluster Detail sidebar in the Clusters Menu

Features The Anthos service encompasses several features (covered in more detail in other chapters), including: Configuration Management Ingress Binary Authorization Cloud Run for Anthos Service Mesh The Features menu provides an easy way to enable and disable specific services for the entire Fleet. Figure 2.11 shows the list of existing features for every cluster. Figure 2.11 Feature Menu

An admin also has the ability to disable/enable most of these features from the interface (some features are integral components of Anthos and cannot be disabled). The same possibility also exists through gcloud or the Fleet Management API for better automation. It’s worth noting that if enablement is not possible fully through the visual interface, the Console generates the right commands for the admin to seamlessly enter them into their CLI. Migrate to containers One of the major benefits to Anthos is the automatable migration of Windows and Linux VMs to containers and their deployment onto a compatible Anthos cluster. Previously, this has primarily been done via CLI and initiated from the source cluster, but this menu now provides a convenient, centralized process for shifting VMs to containers and into a different deployment scheme. This menu contains tabs for viewing and managing your Migrations, Sources, and Processing Clusters. For more

information on the process of migrating your existing VMs to containers, see Chapter 15: Migrate for Anthos. Security The security menu is where we find multiple tools related to viewing, enabling, and auditing the security posture of the clusters in the current Fleet. Figure 2.12 shows the basic view when the Security menu is first selected. Figure 2.12 Security Menu

As you can see, we do not currently have Binary Authorization[10] enabled, but Anthos provides us a shortcut here to quickly turn it on. Once we do, we are presented with the configuration page for Binary Authorization (Figure 2.13) enabling us to view and edit the policy, if needed. Figure 2.13 Binary Authorization Policy Details

2.5 Monitoring and Logging The Anthos menu in the GCP Console is only part of the solution, however. Google also provides the Operations suite, including Cloud Monitoring and Cloud Logging, in order to help with managing the operations of applications and infrastructure. Anthos simplifies the logging of application data and metrics to the Operations suite as part of the default deployment. This can make it simple to add SLOs and SLAs based on these metrics[11]. In addition, several pages within the Anthos menu include shortcuts and buttons that trigger wizards to create SLOs in a guided fashion.

2.6 GKE Dashboard

Google has provided the GKE Dashboard for several years to assist with viewing and managing your clusters for GKE in GCP. With the release of Anthos, the GKE Dashboard has been expanded to be able to view the details for Kubernetes clusters attached via GKE Connect. While the Anthos menu is focused on the clusters at a high-level and on the Anthos-specific features, such as the Service Mesh and Configuration Management, the GKE Dashboard allows an admin to drill down to specific workloads and services. Next section is a simple tutorial to register an Azure AKS cluster into Anthos dashboard.

2.7 Connecting to a Remote cluster In this example, a cluster is already created in the Azure Kubernetes engine (AKS). Google supports several cluster types to be registered remotely, referred to as Attached Clusters[12]. To attach these clusters, you will need to take a few more steps. Step 1: Open a terminal window on a computer that has access to the cluster to be registered. Note the full path to the kubeconfig file used to connect to the cluster. Step 2: In the Google Console, under the IAM section, create a Google Service account with the role "GKE Connect Agent". Generate an account key and save it. Step 3: Decide on the official designation for the cluster in your Anthos project; this is the Membership Name. Step 4: Use the command below to register you cluster, replacing the fields with appropriate information:

gcloud container fleet memberships register < --context= \ --kubeconfig= download.zip

After unzipping the download.zip file, the relevant certs can be found in the certs/lin folder. The file with .0 suffix is the root certificate. Rename it to vcenter.crt and use it in the reference from the installation config file. The vCenter and BigIP F5 credentials are saved in plain text in the config file when creating new user clusters or on installation. One way to secure the F5 credentials is through using a wrapper around Google Cloud Secrets Manager and gcloud. To create a password secured by Google Secret Manager

echo "vcenterp455w0rd" | gcloud secrets create vcenterpass --data-file=- --r

To retrieve a password secured by Google Secret Manager gcloud secrets versions access latest --secret="vcenterpass"

This secret is now protected via Google IAM policies and a wrapper script can be written to retrieve the secret, replace the placeholder in the config file, apply and then delete the file. The process to create Anthos cluster components is quickly evolving, and it’s not uncommon for a newer version to have some changes to the config file. You can follow this link for latest release procedures at https://cloud.google.com/anthos/clusters/docs/on-prem/1.9/how-to/createadmin-workstation Cluster Management: Creating a new user Cluster The gkectl command is used for this operation. As a rule of thumb, admins should constrain the setups so that there is a ratio of one admin cluster to ten user clusters. User clusters should have a minimum of 3 nodes, with a maximum of 100 nodes. As previously mentioned, newer releases may increase these numbers. When a new Anthos

release is published, you can check the new limits in the Quotas and Limits section of the respective release. The general advice is leave some space for at least one cluster which can be created in your on-prem environment. This would give the operations team space to recreate clusters and move pods over when upgrading or triage. Keep good documentation like which IP addresses have been already assigned for other user clusters, so that non-overlapping IPs can be determined easily. Take into consideration that user clusters can be resized to 100 nodes, so reserve up to 100 IP addresses per range to keep that possibility. Source control your configuration files, but do not commit the vsphere username / passwords. Committing such sensitive information into repositories will open up security risks as anyone with access to the repository will be able to get those login details. Tools like ytt can be used to template configuration yamls, code reviews and repository scanners should be used to prevent such mistakes from taking place (e.g. https://github.com/UKHomeOffice/repo-security-scanner). Node pools can also be created with different machine shapes, so size them correctly to accommodate your workloads. This also gives granular control over which machine types to scale and save costs. For production workloads, use three replicas for the user cluster master nodes for high availability, but for dev, one should be fine. Validate the configuration file to make sure the file is valid. The checks are both syntactic and programmatic, such as checking for IP range clashes and IP availability using the gkectl check-config command.

gkectl check-config --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] --config [CONFIG

After the first few validations, most time-consuming validations can be skipped by passing the --fast flag. Next, the seesaw load balancer should be created if the bundled load balancer is chosen. If you do not create the Seesaw node(s) before attempting a cluster

build that has been configured with the integrated load-balancer option, you will receive an error during the cluster pre-check. To create the Seesaw node(s) use the gkectl create loadbalancer command. gkectl create loadbalancer --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] --config

After creation of a new user cluster do remember that for the bundled lb seesaw version, the user will then be able to create the user cluster.

gkectl create cluster --kubeconfig [ADMIN_CLUSTER_KUBECONFIG] --config [CONF

You can also add the --skip-validation-all flag if the config file has already been validated. The whole user cluster process can take 20 - 30 minutes depending on the hardware, which consists of starting up new VMWare virtual machines with the master and worker node images and joining them into a cluster. The administrator is also able to see the nodes being created from the VMWare vCenter console. High availability setup High availability is necessary for Anthos deployments in production environments. This is important as failures can occur at different parts of the stack, ranging from networking, to hardware, to the virtualization layer. High availability for admin clusters makes use of the vSphere High Availability in a vSphere Cluster setup to protect GKE on-prem clusters from going down in the event of a host failure. This ensures that admin cluster nodes are distributed among different physical nodes in a vSphere cluster, so that in the event of a physical node failure, the admin cluster will still be available. To enable HA user control planes, simply specify usercluster.master.replicas: 3 in the GKE on-prem configuration file. This will create three user cluster masters for each user cluster, consuming three times the resources, but providing a high availability kubernetes setup.

Cluster Management: Scaling Administrators are able to use the gkectl cli to scale up or down nodes as seen below. They would change the config file to set the number of expected replicas and execute the following command to update the node pool.

gkectl update cluster --kubeconfig [USER_CLUSTER_KUBECONFIG] --config [CONFI

Cluster Management: Upgrading Anthos Like any upgrade process, there can be failures during the process. A lot of effort has been put into making the upgrade process robust, including the addition of pre-checks before executing the upgrade to catch potential issues before they occur. Each product team at Google works closely when an upgrade is being developed to avoid any potential incompatibilities between components like Kubernetes, ACM, and ASM. For ease of access, bookmark this link for quick access https://cloud.google.com/anthos/docs/version-andupgrade-support. Anthos versions are quite fast moving due to industry demand for new features and this means that upgrading Anthos is a very common activity. Upgrading Anthos can also mean upgrading to a new version of Kubernetes, which also impacts Anthos Service Mesh due to Istio dependency on Kubernetes. This means the upgrade chain is complex, which is why the recommendation in the beginning is to keep some spare hardware resources that can be used to create new versions of Anthos clusters and then move workloads to the new cluster before tearing down the older version cluster. This process reduces the risk associated with upgrades by providing an easy rollback path in case of a failed upgrade. In this type of upgrade path, there should be a load-balancer in front of the microservices running in the old cluster to be upgraded which can direct traffic from the old cluster to the new cluster, as they will exist at the same time. However if this is not an option, administrators are able to upgrade Anthos clusters in place. Firstly, consult the upgrade paths. From GKE on-prem 1.3.2 onwards,

administrators are able to upgrade directly to any version in the same minor release, otherwise sequential upgrades are required. From version 1.7, administrators can keep their admin cluster on an older version, while only upgrading the admin workstation and the user cluster. As a best practice, administrators should still schedule the admin cluster upgrades to keep up to date. Next, download the gkeadm tool which has to be the same as the target version of your upgrade, and run gkeadm to upgrade the admin workstation and gkectl to upgrade your user cluster, and finally the admin cluster. When upgrading in place, a new node is created with the image of the latest version and workloads are drained from the older version and shifted over to the latest version, one node after the other. This means administrators should plan for additional resources in their physical hosts to accommodate at least one user node for upgrade purposes. The full flow can be seen from Figure 5.11. Figure 5.11 Upgrading flow

For a detailed list of commands please consult the upgrading documentation at https://cloud.google.com/anthos/gke/docs/on-prem/howto/upgrading#upgrading_your_admin_workstation for exact details. Cluster Management: Backing up Clusters Anthos admin clusters can be backed up by following the steps found at https://cloud.google.com/anthos/clusters/docs/on-prem/1.8/how-to/back-upand-restore-an-admin-cluster-with-gkectl. This is recommended to be done as part of a production Anthos environment setup to regularly schedule backups and to do on demand backups when upgrading Anthos versions. Anthos user clusters etcd can be backed up by running a backup script which you can read more about backing up a cluster on the Anthos documentation page at https://cloud.google.com/anthos/gke/docs/on-prem/how-to/backing-

up Do note that this only backs up the etcd of the clusters, which means the Kubernetes configuration. Google also states this to be a last resort. Backup for GKE promises to make this simpler, and we look forward to similar functionality for Anthos clusters soon. (https://cloud.google.com/blog/products/storage-data-transfer/google-cloudlaunches-backups-for-gke). Any application specific data, such as persistent volumes are not backed up by this process. Those should be backed up regularly to another storage device using a number of different tools like Velero. You should treat your cluster backups the same as any data that is backed up from a server. The recommendation is to practice restoring an admin and user cluster from backup, along with application specific data to gain confidence in the backup and recovery process. Google has a number of additions in development for Anthos, one important feature being added will be named Anthos Enterprise Data Protection, providing the functionality to backup cluster wide config such as custom resource definitions, and namespace wide configuration and application data from Google Cloud Console into a cloud storage bucket, and be able to restore using the backup as well.

5.2.2 GKE on AWS GKE on AWS uses AWS EC2 instances and other components to build GKE clusters, which means these are not EKS clusters. If a user logs into the AWS console, they will be able to see the admin cluster and user clusters nodes only as individual AWS EC2 instances. It is important to differentiate this between managing EKS clusters from Anthos as the responsibilities assigned to the different cloud providers according to each cluster type is different. GKE on AWS installation is done via the gcloud cli, with the command gcloud container aws clusters create. For terraform users, there is also sample terraform code to install GKE on Anthos in this repository https://github.com/GoogleCloudPlatform/anthos-samples/tree/main/anthos-

multi-cloud/AWS. This will further simplify the installation process and remove the need for a bastion host and management server mentioned in the steps below. The installation process is to first get an AWS KMS key, then use anthos-gke which in turn uses Terraform to generate Terraform code. Terraform is an Infrastructure as Code open source tool to define a target state of a computing environment by Hashicorp. Terraform code is declarative and utilizes Terraform providers which are often contributed by cloud providers such as Google, AWS, Microsoft to map their cloud provisioning APIs to Terraform code. The resulting Terraform code describes how the GKE on AWS infrastructure will look like. It has components which are analogous to GKE on-prem, such as a LoadBalancer, EC2 Virtual Machines, but leverage the Terraform AWS Provider work to instantiate the infrastructure on AWS. You can learn more about Terraform at https://www.terraform.io/. The architecture of GKE on AWS can be seen on Figure 5.12 below which is from the Google Cloud documentation at https://cloud.google.com/anthos/gke/docs/aws/concepts/architecture. Figure 5.12 GKE on AWS architecture

The Use of node pools is similar to GKE, with the ability in a cluster to have different machine sizes. Note:

To do any GKE on AWS operations management, the administrator will have to log into the bastion host which is part of the management service.

Connecting to the management service When doing any management operations, the administrator needs to connect to the bastion host deployed during the initial installation of the management service. This script is named bastion-tunnel.sh which is generated from Terraform during the management service installation. Cluster Management: Creating a new user cluster Use the bastion-tunnel script to connect to the management service. After connecting to the bastion host, the administrator will use Terraform to generate a manifest configuring an example cluster in a yaml file terraform output cluster_example > cluster-0.yaml

In this yaml file the administrator will then change the AWSCluster and AWSNodePool specifications. Be sure to save the cluster file to a code repository, this will be reused for scaling the user cluster. Custom Resources are extensions of Kubernetes to add additional functionality, such as in the case of provisioning AWS EC2 instances. AWS Clusters and objects are represented as yaml referencing the AWSCluster and AWSNodePool Custom Resources in the management service cluster, which interpret this yaml and adjust resources in AWS accordingly. To read more about Custom Resources please refer to https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/customresources/. Cluster Management: Scaling You may experience a situation where a cluster requires additional compute power and you need to scale the cluster out. Luckily, an Anthos node pool has an option to scale a cluster, including a minimum and maximum node count. If you created a cluster with the same count in both the minimum and maximum nodes, you can change it at a later date to grow your cluster. To

scale a cluster for GKE on AWS, you simply require the administrator to modify the YAML file by updating the minNodeCount while creating the user cluster, and applying it to the management service. apiVersion: multicloud.cluster.gke.io/v1 kind: AWSNodePool metadata: name: cluster-0-pool-0 spec: clusterName: cluster-0 version: 1.20.10-gke.600 minNodeCount: 3 maxNodeCount: 10

Cluster Management: Upgrading Upgrading GKE on AWS is done in two steps, with the management service handled first, and then the user clusters. To upgrade a GKE on AWS management service, the administrator has to do this from the directory with the GKE on AWS configuration. The user has to first download the latest version of the anthos-gke binary. Next the user will have to modify the anthos-gke.yaml file to the target version. apiVersion: multicloud.cluster.gke.io/v1 kind: AWSManagementService metadata: name: management spec: version:

Finally to validate and apply the version changes by running anthos-gke aws management init anthos-gke aws management apply

The management service will be down, so no changes to user clusters can be applied, but user clusters continue to run their workloads.

To upgrade the user cluster, the administrator switches context in the management service from the GKE on AWS directory using the following command. anthos-gke aws management get-credentials

Then upgrading the version of the user cluster is as simple as using kubectl edit awscluster

And then editing the yaml to point to the right GKE version. apiVersion: multicloud.cluster.gke.io/v1 kind: AWSCluster metadata: name: cluster-0 spec: region: us-east-1 controlPlane: version:

On submission of this change, the CRD starts to go through the nodes in the control plane one by one and start upgrading them to the latest GKE on AWS version. This upgrade process causes a downtime of the control plane, which means the cluster may be unable to report the status of the different node pools until it is completed. Finally, the last step is to upgrade the actual Node Pool. The same procedure applies and the administrator simply edits the yaml to the version required and applies the yaml to the management service. apiVersion: multicloud.cluster.gke.io/v1 kind: AWSNodePool metadata: name: cluster-0-pool-0 spec: clusterName: cluster-0 region: us-east-1 version:

5.3 Anthos attached clusters

Anthos attached clusters are managed conformant Kubernetes which are provisioned and managed by Elastic Kubernetes Service (EKS) by AWS, Azure Kubernetes Service (AKS) or any conformant Kubernetes cluster. In this case, the scaling and provisioning of the clusters are done from the respective clouds. However these clusters can still be attached and managed by Anthos by registering them to Google Cloud through deployment of the Connect agent as seen from Figure 5.13. Google Kubernetes Engine is also handled in the same way and can be attached from another project into the Anthos project. Figure 5.13 Adding an external cluster (Bring your own Kubernetes)

1. The administrator has to generate a kubeconfig to the EKS or AKS cluster, and then provide that kubeconfig in a generated cluster registration command in gcloud. Please consult documentation from AWS and Azure on how to generate a kubeconfig file for the EKS or AKS clusters. The administrator is also able to generate one manually using the template below and providing the necessary certificate, server info and service account token.

apiVersion: v1 kind: Config users: - name: svcs-acct-dply user: token: clusters: - cluster: certificate-authority-data: Requires container images to contain a digest. https://kubernetes.io/docs/concepts/containers/images/ spec: crd: spec: names: kind: K8sImageDigests validation: openAPIV3Schema: type: object description: >Requires container images to contain a digest. https://kubernetes.io/docs/concepts/containers/images/ properties: exemptImages: description: >Any container that uses an image that matches an entry in this list will be from enforcement. Prefix-matching can be signified with `*`. For example: `m It is recommended that users use the fully-qualified Docker image name (e.g. in order to avoid unexpectedly exempting images from an untrusted repository type: array items: type: string targets: - target: admission.k8s.gatekeeper.sh rego: | package k8simagedigests import data.lib.exempt_container.is_exempt violation[{"msg": msg}] {

container := input.review.object.spec.containers[_] not is_exempt(container) satisfied := [re_match("@[a-z0-9]+([+._-][a-z0-9]+)*:[a-zA-Z0-9=_-]+", conta not all(satisfied) msg := sprintf("container uses an image without a digest ", [contai } violation[{"msg": msg}] { container := input.review.object.spec.initContainers[_] not is_exempt(container) satisfied := [re_match("@[a-z0-9]+([+._-][a-z0-9]+)*:[a-zA-Z0-9=_-]+", conta not all(satisfied) msg := sprintf("initContainer uses an image without a digest ", [co } violation[{"msg": msg}] { container := input.review.object.spec.ephemeralContainers[_] not is_exempt(container) satisfied := [re_match("@[a-z0-9]+([+._-][a-z0-9]+)*:[a-zA-Z0-9=_-]+", conta not all(satisfied) msg := sprintf("ephemeralContainer uses an image without a digest " } libs: - | package lib.exempt_container is_exempt(container) { exempt_images := object.get(object.get(input, "parameters", {}), "exemptImag img := container.image exemption := exempt_images[_] _matches_exemption(img, exemption) } _matches_exemption(img, exemption) { not endswith(exemption, "*") exemption == img } _matches_exemption(img, exemption) { endswith(exemption, "*") prefix := trim_suffix(exemption, "*") startswith(img, prefix) }

It’s important to note that the Rego code contains multiple violation sections. At a first glance, it may appear that the code is the same for each, but on closer inspection, you will notice one minor difference on the container := line. The first violation block will check all containers for a digest, while the second violation block will check all initContainers for a digest, and the third checks any ephemeralContainers Since they are all unique objects, we need to

include each object in our code, or it will not be checked by the policy engine. Finally, to activate the constraint, we apply a manifest that uses the new custom resource created by the above template, K8sImageDigests. apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sImageDigests metadata: name: container-image-must-have-digest spec: match: kinds: - apiGroups: [""] kinds: ["Pod"]

Once applied to the cluster, any new pod request that does not supply a digest will be denied by the admission controller.

13.4.2 Using Binary Authorization to Secure the Supply Chain Since the SolarWinds security breach, there has been a spotlight on how you need to secure your software supply chain. This is something that should have always been considered and implemented, but it often takes an event like the SoarWinds breach to capture the attention of the public. Securing the supply chain is a large topic and to give it the coverage it deserves would require a dedicated chapter, but we wanted to provide an overview of the tools Google provides to help you secure your supply chain. You may have recently heard the term, “Shifting left on security”. This term refers to the practice of considering security earlier in the software development process. There are a number of topics to consider when shifting left, and if you want to read an independent report that was sponsored by companies including Google, Cloudbees, Deloitte, and more, read the State of DevOps from 2019, which covers key findings from multiple companies and their DevOp practices, located at https://cloud.google.com/devops/stateof-devops. Anthos includes a powerful tool that centralizes software supply chain security for workloads on both Anthos on GCP and Anthos on-prem, called

Binary Authorization (BinAuth). At a high level, BinAuth adds security to your clusters by requiring a trusted authority signature on your deployed images, which is attested when a container is deployed. If the deployed container does not contain a signature that matches the trusted authority, it will be denied scheduling and fail to deploy. Google’s BinAuth provides you a number of features including: Policy creation Policy enforcement and verification Cloud security command center integration Audit logging Cloud KMS support Uses the open-source tool, Kritis, for signature verification Dry run support Breakglass support 3rd party support including support for Twistlock, Terraform, and CloudBees Along with the features provided, you can integrate BinAuth with Google’s Cloud Build and Container registry scanning, allowing you to secure your supply chain based on build metadata and vulnerability scans. Google has a number of integrations docs that will step you through integrating BinAuth with a number of systems like CircleCI, Black Duck, Terraform, and Cloud Build on the Binary Authorization page, located here https://cloud.google.com/binary-authorization/.

13.4.3 Using Gatekeeper to Replace PSPs As Kubernetes starts to work on deprecating PSPs, you may want to start moving away from using PSPs to secure your clusters. One way to move away from PSPs as your main security mechanism, is to migrate to using Gatekeeper policies instead of PSPs. The Gatekeeper project has a Github repository dedicated to policies that are designed to replace PSPs at https://github.com/open-policy-agent/gatekeeperlibrary/tree/master/library/pod-security-policy.

In the next section, we will close out the chapter by learning about securing your images using Google Container scanner.

13.5 Understanding Container Vulnerability Scanning Like any standard operating system or application, containers may contain binaries that have known vulnerabilities. To keep your cluster secure, you need to verify the integrity of your containers by continuously scanning each one. There are many solutions on the market today to scan containers for vulnerabilities including Aqua Security, Twistlock, Harbor, and Google’s Container Registry. Each of these tools offer different levels of scanning abilities and in most cases, an additional cost. At a minimum, you will want to scan your images for any vulnerabilities from the Common Vulnerabilities and Exposures (CVE) list. The CVE list[144] is a publicly disclosed list of security vulnerabilities for various software components, including operating systems and libraries. Entries in the list only contain a brief overview of the vulnerability, they do not contain any detailed information like impact, risks, or how to remediate the issue. To retrieve the details for the CVE, each entry has a link that will take you to the National Vulnerability Database (NVD), which will provide additional details about the CVE including a description, severity, references and a change history. While Anthos does not include a vulnerability scanner, Google does provide scanning if you store your images in Google’s Container Registry. In this section we will explain how to enable scanning on your repository, and how to view the scanning results.

13.5.1 Enabling Container Scanning The first requirement to enable scanning in your registry is to enable two API’s on your GCP project, the container analysis API and the container

scanning API. The container analysis API enables metadata storage in your project, and is free, while the container scanning API will enable vulnerability scanning, and is charged per scanned image. You can view the pricing details for the scanning API at https://cloud.google.com/containerregistry/pricing To enable the required API’s using the gcloud CLI, follow the steps below: 1. Set your default project. Our example is using a project called test1236415. gcloud config set project test1-236415

2. Next, enable the container analysis API gcloud services enable containeranalysis.googleapis.com

3. Finally, enable the container scanning API gcloud services enable containerscanning.googleapis.com

Once the API’s have been enabled on the project, you will need to create a repository to store your images in. The example below will create a Docker registry called docker-registry in the us-east4 location with a description of Docker Registry.

gcloud artifacts repositories create docker-registry --repository-format=doc

In order to push images to your repository, you need to configure Docker on your client to use your GCP credentials. Authentication to repositories in GCP is configured on a per region basis, in the previous step, we created a registry in the us-east4 zone, so to configure authentication we would execute the gcloud command below: gcloud auth configure-docker us-east4-docker.pkg.dev

Now that your registry and Docker have been configured, you can start to use your registry to store images. In the next section, we will explain how to tag images and push them to your new repository.

13.5.2 Adding Images to your Repository Adding an image to a GCP registry follows the same steps that you would use for any other Docker registry, but the tag may be different from what you are used to using. To add an image to your registry, you must: If you do not have the image locally, you must either build a new image using Docker, or pull the image from another registry Tag the image with your GCP registry Push the new image to the registry For example, to add a CentOS 8 image to a registry, follow the steps below: 1. Download the CentOS 8 image from Docker hub docker pull centos:8

2. Next, tag the newly pulled image with the Google registry information. When you tag an image that will be stored in a GCP registry, you must follow a specific naming convention. The image tag will use the convention LOCATIONdocker.pkg.dev///. In the example, the region is us-east4, using project test-236415 and a registry named docker-registry.

docker tag centos:8 us-east4-docker.pkg.dev/test1-236415/docker-registry/cen

3. Finally, push the new image to the registry docker push us-east4-docker.pkg.dev/test1-236415/docker-registry/centos:8

In the next section, we will explain how to look at your images and any vulnerabilities that have been found in them.

13.5.3 Reviewing Image Vulnerabilities Since our project has the required API’s enabled, each image will be scanned when it is pushed to the registry. To review the vulnerabilities, open the GCP console and click on Artifact Registry and then repositories under the tools

section. Figure 13.4 Navigating to your Registries

This will bring up all the registries in your project. Continuing with our example, we created a registry called docker-registry, as shown in the image below. Figure 13.5 Project Registries

Open the repository that you pushed the image to and click on the image to view the images. Previously, we pushed the CentOS image to our registry. Figure 13.6 Images List

Clicking the image, you want more details for will show you the digests for the image, and the number of vulnerabilities that the image contains. In our example, Figure 13.7 Image Hash List

To view each of the vulnerabilities, click on the number in the vulnerabilities column, which will open a new window listing all CVE’s for the image. Depending on the image and the scan results, you may see different links or options for the CVE’s. Using our CentOS image example, we can see that the results have a link to view fixes for each CVE. Figure 13.8 CVE Example list with Fixes

In another example, an Ubuntu image, there are no fixes listed in the CVE’s, so the results screen will contain different options. Figure 13.9 CVE Example without Fixes

You can view additional details for each CVE by clicking on the CVE in the name column, or you can click on the VIEW link on the far-right hand side. Clicking the CVE name will take you to the vendor’s site, while clicking the view link will provide additional details about the vulnerability. In this section we introduced Google’s container registry scanning, how to enable it, and how to view the scanning results. This was only an introduction to the service, you can expand the functionality by integrating with Pub/Sub, adding access controls, and more. To see additional documentation, you can visit Google’s How-to guides at https://cloud.google.com/containeranalysis/docs/how-to

13.6 Understanding Container Security There are two main concepts to consider when you are creating a security policy: the user the container will run as, and whether the container can run in privileged mode. Both of these ultimately decide what access, if any, a potential container breakout will have on the host. When a container is started, it will run as the user that was set at the time of image creation, which is often the root user. However, even if you run a container as root, it doesn’t mean that the processes inside the container will have root access on the worker node since the Docker daemon itself will

restrict host level access - depending on the policy regarding privileged containers. To help explain this, reference the table below for each setting and the resulting permissions. Table 13.3 Root and privileged container permissions

Running Container User

Privileged Host Permissions Value

Running as root

False

None

Running as root

True

Root access

Running as non-root

False

None

Running as non-root

True

Limited, only permissions that have been granted to the same user on the host system

Both of these values determine what permissions a running container will be granted on the host. Simply running an image as root does not allow that container to take run as root on the host itself. To explain the impact in greater detail, we will show what happens when you run a container as root, and how allowing users to deploy privileged containers can allow someone to take over the host.

13.6.1 Running Containers as Root Over the years, container security has received a somewhat bad reputation. Many of the examples that have been used as evidence to support this are in fact not container issues, but configuration issues on the cluster. Not too long ago, many developers created new images running as root, rather than creating a new user and running as the new user, limiting any security impact. This is a good time to mention that if you commonly download images from third party registries that you should always run them in a sandboxed environment before using them in production. You don’t know how the image was created, who it runs as, or if it contains any malicious code always inspect images before running them in production. In the last section of this chapter, we will cover Google’s Container Scanning, which will scan your images for known security concerns. There are multiple tools that you can use to limit deployments of malicious containers, including: Container scanning - Included in the Google Container Registry with scanning Allowing only trusted container repositories - either internal or trusted partner registries Requiring images to be signed One of the most dangerous, and commonly overlooked security concerns, is allowing a container to run as root. To explain why this is a bad practice, let’s use a virtual machine example - would you allow an application to run as root, or as administrator? Of course you wouldn’t. If you had a web server running its processes as an administrator, any application break out would be granted the permissions of the user that was running the process. In this case, that would be an account with root or administrator privileges, providing full access to the entire system. To mitigate any issues from a breakout, all applications should be run with their least required set of permissions. Unfortunately, it is far too common for developers to run their container as root. By running a container as root, any container breakout would grant the intruder access to any resources on the

host. Many images on Docker hub or Github are distributed using root as the default user, including the common `busybox` image. To avoid running an image as root, you need to create and set a user account in your image, or supply a user account when you start the container. Since `busybox` is normally pulled from Docker hub, we can run it with a non-root account by configuring a security context in the deployment. As part of a pod definition, the container can be forced to run as a user by adding the securityContext field, which allows you to set the context for the user, group, and fsGroup. spec: securityContext: runAsUser: 1500 runAsGroup: 1000 fsGroup: 1200

Deploying the image with the additional securityContext will execute the container as user 1500. We also set the group to 1000 and the fsGroup to 1200. We can confirm all these values using the `whoami` and `groups` command, as shown in Figure 13.10 below: Figure 13.10 A pod running as the root user, and using securityContext to change the defined user and user groups.

The UID and group IDs that were used are unknown in the image since it was pulled from Docker hub and it only contains the users and groups that were included when the image was created. In an image that you or someone in your organization created, you would have added the required groups during the Docker build, and would not receive the unknown ID warnings. In this section we explained how you can set a security context to run an image as a non-root user or group at deployment time. This only covers the first half of securing our hosts from malicious containers, the next section

will explain how privileged containers can impact our security and how they work together to provide access to the host.

13.6.2 Running Privileged Containers By default, containers will execute without any host privileges. Even when you start a container as root, any attempts to edit any host settings will be denied. Figure 13.11 Non-privileged container running as root

For example, we can try to set a kernel value from a container that is running as root, but not as a privileged container. Figure 13.12 Example kernel change from a container without privileges

The kernel change is denied since the running image does not have elevated privileges on the host system. If there was a reason to allow this operation from a container, the image could be started as a privileged container. To run a privileged container, you need to allow it in the securityContext of the pod. apiVersion: v1 kind: Pod metadata: name: root-demo spec: containers: - name: root-demo image: busybox command: [ "sh", "-c", "sleep 1h" ] securityContext: privileged: true

Now that the pod has been allowed to run as a privileged container, and it is running as root, it will be allowed to change kernel parameters. Figure 13.13 Privileged container running as root

In the screenshot below, notice that the domain name change does not return an error, verifying that the container can modify host level settings. Figure 13.14 Host Kernel change allowed from a running container

This time, the kernel change worked for two reasons, the first is that the container is running as the root user and the second is that the container was allowed to start up as a privileged container. For the last scenario, the manifest has been edited to run as user 1000, who does not have root privileges, and to start as a privileged container.

apiVersion: v1 kind: Pod metadata: name: root-demo spec: containers: - name: root-demo image: busybox command: [ "sh", "-c", "sleep 1h" ] securityContext: privileged: true runAsUser: 1000

Since the container is running as a standard user, and the container is running as a privileged container, any kernel change will be denied. Figure 13.15 Privileged container running as non-root

In summary, securing the actions that a container may be able to take on the host are controlled by the user that is running in the container and whether or not the container is allowed to run as a privileged container. To secure a

cluster, you will need to create a policy that defines controls for each of these values. Right now, you know why containers should not be allowed to run as root and why you should limit pods that are allowed to run as a privileged container, but we haven’t explained how to stop either of these actions from occurring on a cluster. This is an area that Anthos excels in! By providing Anthos Config Manager, Google has included all the tools you need to secure your cluster from these and many other common security settings. In the next section, we will explain how to use ACM to secure a cluster using the included policy manager, Gatekeeper.

13.7 Using ACM to Secure Your Service Mesh As you have seen throughout this book, Anthos goes beyond simply providing a basic Kubernetes cluster, providing additional components like Anthos Service Mesh (ASM) to provide a service mesh, Binary Authorization, Serverless workloads, and ACM to handle Infrastructure as Code. In the Config management architecture chapter, you learned about designing and configuring ACM to enforce deployments and objects on an Anthos cluster. In this section, we will ACM to secure communication between services in a cluster by using a policy. We will then move on to an additional component included with ACM, the Policy Controller, which provides an admission controller based on the open source project, Gatekeeper. Note:

When enabling mTLS using an ACM policy, remember that the policy will be applied to all clusters that are managed by the external repository, unless you use a ClusterSelector to limit the clusters that will be configured.

13.7.1 Using ACM to Enforce Mutual TLS In the chapter Anthos Service Mesh: security and observability at scale, you

learned that ASM includes the ability to encrypt traffic between services using mutual TLS (mTLS). Mutual TLS is the process of verifying service identities before allowing communication between the services, via Istio’s sidecar. Once the identities have been verified, the communication between the services will be encrypted. However, by default, Istio is configured to use permissive mTLS. Permissive mTLS allows a workload that does not have a sidecar running to communicate with a sidecar enabled service using HTTP (plaintext). Developers or administrators that are new to service meshes generally use the permissive setting. While this is beneficial to learning Istio, allowing HTTP traffic to a service running a sidecar makes it insecure - nullifying the advantages of Istio and the sidecar. Once you are comfortable with Istio, you may want to consider changing the permissive policy to the more secure strict setting. You can force strict mTLS for the entire mesh, or just certain namespaces by creating a Kubernetes object, called PeerAuthentication. Deciding on the correct scope for mTLS is different for each organization and cluster. You should always test any mTLS policy changes in a development environment before implementing them in production, to avoid any unexpected application failures. Since this is an important policy, it’s a perfect example to demonstrate the importance of using ACM as a configuration management tool. Remember that once an object is managed by ACM, the configuration manager will control it. This means that the manager will re-create any managed object that is edited or deleted for any reason. For the mTLS use case, you should see the importance of using ACM to make sure that the policy is set, and if edited, remediated to the configured strict value. To enable a strict mTLS mesh wide policy, you need to create a new PeerAuthentication object that sets the mTLS mode to strict. An example manifest is shown below. apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default

namespace: istio-system spec: mtls: mode: STRICT

The manifest assumes that Istio has been installed in the istio-system namespace. SInce the namespace selector is the istio-system namespace, it will enforce a strict mTLS policy for all namespaces in the cluster. Note:

To enforce a strict mTLS policy for every namespace in the cluster, the PeerAuthentication object must be created in the same namespace that Istio was installed in - By default, this is the istio-system namespace. If you have decided to implement a per namespace enforcement, the manifest requires a single modification, the namespace value. For example, if we wanted to enable mTLS on a namespace called webfront, we would use the manifest below. apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: webfront spec: mtls: mode: STRICT

To use either of these manifests with ACM to enforce a strict mTLS mesh policy, you simply need to store it in your ACM repository. Since the policy is stored in the ACM repository, it will be managed by the controller, and any changes or deletion will result in the object being recreated using the strict setting. The mTLS policy is just an example of how we can use ACM and ASM together to enforce a security policy for a cluster. In the next section, we will explore a new component that provides additional security to an Anthos cluster, the policy controller.

13.8 Conclusion ACM’s policy engine is a powerful add-on included with all Anthos clusters. Gatekeeper allows an organization to create granular policies to secure a cluster against potential attackers by providing additional security and stability. Google provides a collection of default policies that addresses some of the most common security concerns that have been collected from the community and Google’s own experiences. If the included policy library doesn’t address a security issue in your organization, you can create your own policies by using Gatekeeper’s policy language, Rego.

13.9 Examples and Case Studies Using the knowledge from the chapter, address each of the requirements in the case study found below.

13.9.1 Evermore Industries Evermore Industries has asked you to evaluate the security of their Anthos Kubernetes cluster. The cluster has been configured as outlined below: Multiple control plane nodes Multiple worker nodes ASM to provide Istio, configured with permissive mTLS ACM configure with the policy engine enabled including the default template library They have asked you to document any current security concerns and remediation steps to meet the following requirements: 1. Audit for any security concerns and provide proof of any exploit covered by policies 2. All containers must only be allowed to pull from an approved list of registries, including: a. gcr.io b. hub.evermore.local

3. All policies, other than the approved registry policy, must be tested to assess the impact before being enforced 4. Containers must deny any privilege escalation attempts, without affecting any Anthos namespaces, including: a. kube-system b. gke-system c. config-management-system d. gatekeeper-system e. gke-connect 5. Containers must not be able to use hostPID, hostNetwork, or hostIPC in any namespace other than the kube-system namespace 6. All requirements must be addressed using only existing Anthos tools The next section contains the solution to address Evermore’s requirements. You can follow along with the solution, or if you are comfortable, configure your cluster to address the requirements and use the solution to verify your results. Evermore Industries Solution - Testing the Current Security Meets requirements: 1

The first requirement requires you to document any security concerns with the current cluster. To test the first three security requirements, you can deploy a manifest that attempts to elevate the privileges of a container. The test manifest should pull an image from a registry that is not on the approved list and set the fields to elevate privileges and the various host values. We have provided an example manifest below. apiVersion: v1 kind: Pod metadata: labels: run: hack-example name: hack-example spec: hostPID: true hostIPC: true hostNetwork: true volumes:

- name: host-fs hostPath: path: / containers: - image: docker.io/busybox name: hack-example command: ["/bin/sh", "-c", "sleep infinity"] securityContext: privileged: true allowPrivilegeEscalation: true volumeMounts: - name: host-fs mountPath: /host

This manifest will test all the security requirements in a single deployment. The image tag that is being pulled is from docker.io, which is not in the approved registry list. It also maps the hosts root filesystem into the container at mount /host and it is starting as a privileged container. Since the container started successfully, we can document that the cluster can pull images from registries that are not in the accepted list. A successful start also shows that the pod started as a privileged container and that the mount to hostPath also succeeded. To document that the container does have access to the host filesystem, we can exec into the image and list the /host directory. The image below shows that we can successfully list the hosts root filesystem. Figure 13.16 Accessing the Host filesystem in a container

After capturing the output and adding it to the documentation, you can delete the pod since we will need to test the same deployment with the policies enabled in the next test. You can delete it by executing kubectl delete -f usecase1.yaml Evermore Industries Solution - Adding Repo Constraints Meets requirement : 2

Evermore’s second requirement is that containers can only be pulled from trusted registries. In the requirements, only images pulled from gcr.io and hub.evermore.local are allowed to be deployed in the cluster. To limit images to only the two registries, we need to create a new ConstraintTemplate that uses the k8sallowedrepos.constraints.gatekeeper.sh object. An example ConstraintTemplate is provided below. apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sAllowedRepos metadata:

name: allowed-registries spec: match: kinds: - apiGroups: [""] kinds: ["Pod"] parameters: repos: - "gcr.io" - “hub.evermore.local”

Once this manifest is deployed, any attempts to pull an image from a registry other than gcr.io and hub.evermore.local will result in the admission controller denying the pod creation with an error that an invalid image repo was used.

Error creating: admission webhook "validation.gatekeeper.sh" denied the requ

Now that we have addressed requirement 2, we can move on to address requirements 3 and 4. Evermore Industries Solution - Adding Privileged Constraints Meets requirements: 3 and 4

We need to address the security requirements for Evermore’s cluster. To secure the cluster from running privileged pods in the cluster, but not affecting pods in any Anthos system namespaces, we need to enable a constraint with exemptions. However, before enabling a constraint, Evermore has required that all constraints be tested and the output of affected pods be supplied as part of the documentation. The first step is to create a manifest to create the constraint. The manifest shown below will create a constraint called privileged-containers in auditing mode only. It will also exclude all of the system namespaces that Evermore has supplied in the requirements document. apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sPSPPrivilegedContainer metadata: name: privileged-containers

spec: enforcementAction: dryrun excludedNamespaces: - kube-system - gke-system - config-management-system - gatekeeper-system - gke-connect match: kinds: - apiGroups: [""] kinds: ["Pod"]

To add the audit output to the documentation, you must describe the constraint and direct the output to a file, by executing the kubectl command below:

kubectl get K8sPSPPrivilegedContainer psp-privileged-container -o yaml > pri

This will create a file called privtest in the current folder, containing the audit results for the psp-privileged-container constraint. You should check the file to verify that it contains the expected audit results under the violations section. Here is an abbreviated output from our audit:

violations: - enforcementAction: dryrun kind: Pod message: 'Privileged container is not allowed: cilium-agent, securityContext {"capabilities": {"add": ["NET_ADMIN", "SYS_MODULE"]}, "privileged": true}' name: anetd-4qbw5 namespace: kube-system - enforcementAction: dryrun kind: Pod message: 'Privileged container is not allowed: clean-cilium-state, securityC {"capabilities": {"add": ["NET_ADMIN"]}, "privileged": true}' name: anetd-4qbw5

You may have noticed that the audit output contains pods running in namespaces that were added as an exclusion. Remember that when you exclude a namespace in a constraint, the namespace will still be audited - the exclusion only stops the policy from being enforced. Since the output looks correct, we can enforce the policy to meet the security

requirements to deny privileged containers. To remove the existing constraint, delete it using the manifest file using kubectl delete -f Next, update the manifest file and remove the enforcementAction: dryrun line from the manifest and redeploy the constraint. Evermore Industries Solution - Adding Host Constraints Meets requirements: 3 and 5

The 5th requirement from Evermore is to deny hostPID, hostNetwork, and hostIPC in all namespaces, except kube-system. We also need to test the policy before implementation, as stated in the requirements. To meet the set requirements, we need to implement two new policies. The first, k8spsphostnamespace, will block access to host namespaces including hostPID and hostIPC. Finally, to address blocking hostNetwork, we need to implement the k8spsphostnetworkingports policy. To block access to host namespaces from all namespaces, except kubesystem, you need to create a new constraint that exempts kube-system. We also need to test the constraint before it’s implemented, so we need to set the enforcementAction to dryrun. An example manifest is shown below. apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sPSPHostNamespace metadata: name: psp-host-namespace spec: enforcementAction: dryrun excludedNamespaces: - kube-system match: kinds: - apiGroups: [""] kinds: ["Pod"]

After this manifest has been deployed, any attempts by a pod to use a host namespace like hostPID will be denied startup by the admission controller.

Setting the dryrun option will only audit the policy, without enforcing it. Once it’s tested, you can remove the enforcementAction: dryrun from the manifest and deploy it to enforce the policy. To block hostNetworking, we will need to create another constraint that will use the k8spsphostnetworkingports policy. apiVersion: constraints.gatekeeper.sh/v1beta1 kind: K8sPSPHostNetworkingPorts metadata: name: psp-host-network-ports spec: enforcementAction: dryrun excludedNamespaces: - kube-system match: kinds: - apiGroups: [""] kinds: ["Pod"] parameters: hostNetwork: false

Just like the previous constraint, we have included the dryrun option to test the constraint before being enforced. Once tested and deployed, any pod that attempts to set hostNetwork to true, will be denied by the admission controller with an error stating that only hostNetwork=false is allowed.

Error creating: admission webhook "validation.gatekeeper.sh" denied the requ

Congratulations! By deploying the last two constraints, we have met all of Evermore’s requirements.

13.10 Summary Anthos and GCP provides portability and mobility to developers, persistent storage to accommodate workloads beyond stateless applications, and options for analytics. Anthos aims at delivering rigour to data workloads, becoming a single layer of control, security, observability and communication between components

Anthos manages first party CSI drivers on each platform and the Anthos Ready Storage program qualifies third party drivers from industryleading storage partners. Anthos isolates stateful applications from the heterogeneity of the underlying hardware and makes stateful containers more portable Anthos supports a wide selection of storage systems, both first and third party, meeting the user where they are and allowing them to leverage their existing storage systems An understanding of portability versus mobility - Portability refers to the ability to execute on several locations while mobility refers to the ability to transfer the location of a resource from one physical place to another. Kubernetes, and in turn, Anthos, supports storage using the Container Storage Interface (CSI). BigQuery Omni can execute on a GCP, GKE on AWS or on Microsoft Azure. This allows BigQuery query engine to access the customer’s data residing on other public clouds without requiring any data to be moved. Anthos Hybrid AI enables the deployment and the execution of Google trained models on-prem, meeting data residency requirements. Optical Character Recognition (OCR) decomposes documents into a digital format where individual components can be analyzed Speech-to-text converts audio recordings into text, where Natural Language Processing (NLP) can be used to understand the semantics of the speech [143]

Kubernetes.io does provide an initial set of recommendations: https://kubernetes.io/docs/concepts/security/pod-security-standards/ [144]

http://cve.mitre.org/cve

14 Marketplace This chapter covers: The Public Google marketplace The Private Google marketplace Deploying a Marketplace solution Real-world scenarios Google Cloud Marketplace is a one-stop solution to try, buy, configure, manage and deploy software products. Frequently, there are multiple vendors offering the same package, providing an array of options to select from for your specific use case and industry in terms of operative systems, VMs, containers, storage costs, execution environment and SaaS services. Google Cloud offers new users an initial credit that can also be used on the Marketplace. As of Jan 2021, this credit is $300 but it might change in the future. In this chapter, we will discuss how Google Cloud Marketplace can be used to deploy packages automatically in different Kubernetes environments, including Anthos, GKE and GKE on-prem. When it comes to simplifying the developer experience, Marketplaces add values by making it as easy as possible for users to install components, whilst making use of the maintainers/providers opinionated, best-practice configuration

14.1 The Google Marketplace The GCP Marketplace website offers a single place for GCP customers to find free and paid applications, either provided by Google or by our third party partners, who extend what we offer in the platform. Deployments can use either default configurations or be specialized in case of specific needs such as increased memory, storage or more computational power with larger vCPUs. Each package has specific instructions for getting assistance after installation. Note that the marketplace team keeps each image updated to fix critical issues and bugs. However, it is your responsibility to update the solutions that are already deployed in your environments.

14.1.1 Public Marketplace Currently, there are more than two thousand solutions available across GCP including proper application packages and datasets. The GCP Marketplace can be accessed by clicking on the marketplace link in the Cloud console. To select a package, you can either search for a package name or you can browse using the left hand pane of the marketplace screen, as shown in Figure 14.1. Deploying a solution from the Marketplace makes it easy to deploy new applications with a simple “point and click” operation which is the same across multiple environments, either public clouds or on-prem (e.g. your own data center). Figure 14.1 Accessing Marketplace via GCP console and listing packages

For the scope of this book we are interested in applications running on Kubernetes. All the user has to do is click on the Kubernetes Apps category in the Cloud Marketplace Home page. (https://cloud.google.com/marketplace/browse)

Currently there are about 100 solutions available for GKE in different areas such as Networking, Databases, Analytics, Machine Learning, Monitoring, Storage and more, as shown at the bottom of Figure 14.1. Solutions are categorized according to the license model: either open source, or Paid or “Bring Your Own License” (BYOL). BYOL Bring is a licensing model that allows enterprises to use their licenses flexibly, whether onpremise, or in the cloud. Figure 14.2 Solutions available for Anthos GKE environment.

As of January 2022, a smaller set (45 solutions) have been tested against the GKE on-prem environment. We can view the solutions for on-prem by adding the appropriate filter “deployment-env:gke-on-prem”, as shown in Figure 14.2, to the search options. As of January 2022, 78 solutions have been tested against Anthos environments, as shown in Figure 14.3. Figure 14.3 Solutions available for Anthos GKE-on-prem environment.

While browsing the solutions in the marketplace, you can identify third party solutions that are compatible with Anthos by looking for the “Works with Anthos” logo attached to the listing. In Figure 14.5, you can see solutions that have a small Anthos button - this button is shown to showcase the solutions that have been certified to work with Anthos. These listings conform to the requirements of the Anthos Ready Program (https://cloud.google.com/anthos/docs/resources/anthos-ready-partners), which identify partner solutions that adhere to Google Cloud’s interoperability requirements and have been validated to work with the Anthos platform in order to meet the infrastructure and application development needs of enterprise customers. To qualify, partner-provided

solutions must complete, pass, and maintain integration requirements to earn the Works with Anthos badge. Figure 14.4 Solutions available for Anthos environment.

If you click on one of the offerings that are certified for Anthos, you will see the “Works with Anthos logo” in the details screen for the selected offering, as shown in Figure 14.5. Figure 14.5 Example solution certified “Works with Anthos”

The public marketplace provides enterprises with a quick deployment for a number of applications provided by various vendors, including Netapp, Aqua, Jfrog, Citrix, and more. But, what if you wanted to add your own solution to a marketplace for your developers? Of course, you probably wouldn't want this included in the public Marketplace, and Google has accounted for this by offering a Private Marketplace that we will discuss in the next section.

14.1.2 Private Catalog Private Catalog offers marketplace capabilities to private enterprises to be used internally by the enterprises without exposing it to the rest of the world. Administrators can manage visibility of applications and deployment rights at organization, folder, and project levels. The Deployment manager can be used to define preset configurations such as deploy regions, types of servers used for deployment, deployments rights, and other parameters according to enterprise policy. You can access Private Catalog via Cloud console Navigation Menu under Tools (see Figure 14.6). From there it is possible to create new private marketplaces, add applications, and configure access rights. Each Private Catalog should be hosted by a GCP project and it is possible to add Catalog IAM permissions at the Folder and Project level. Sharing a catalog with a GCP Org, Folder or Project allows customers to share their solutions with their end users. The steps are very intuitive and the interested reader can find

more information online[145]. Figure 14.6 Accessing Private Catalog.

14.1.3 Deploying on a GKE on-prem cluster If your intent is to deploy solutions from Marketplace to an Anthos GKE onprem cluster then you need to define one or more namespaces on the target clusters and annotate the namespace with a secret which will allow you to deploy the chosen solutions. There are a number of steps required: 1. If your cluster runs Istio, then any external connection to third-party services are blocked by default so it’s important to configure Istio egress traffic to allow connection to the external OS package repository (see Chapter 5, Anthos Service Mesh: security and observability at scale ). 2. You need to allow the downloading of images from the Google Container Registry by creating a firewall or proxy rule which allows

access to marketplace.gcr.io. 3. In your GKE on-prem cluster, you might need to create a Google Cloud service account. This can be done via a cloud shell like this: gcloud iam service-accounts create sa-name \ --description="sa-description" \ --display-name="sa-display-name"

4. You need to sign in to your Anthos GKE on-prem cluster using a token or credentials for a Kubernetes Service Account (KSA) with Kubernetes "cluster-admin" role. (Roles were discussed in Chapter 4, Anthos, the computing environment built on Kubernetes) This would allow you to have super-user access to perform any action on any resource. From the console, you can generate a new public/private key pair downloaded to your machine by running the command:

gcloud iam service-accounts keys create ~/key.js --iam-account download.zip

Once downloaded, unzip the file, which will extract the certificate in the certs/lin folder. If your environment requires a proxy server to access the internet, you can configure the proxyUrl section. This configuration parameter is used only by the gkeadm command during the VM deployment. proxyUrl: "https://my-proxy.example.local:3128"

When a proxy is configured, you will also need to add the appropriate addresses to the OS or system no_proxy variable. This configuration is specific for each company and deployment - a full explanation of how proxy servers work is beyond the scope of this book. As a starting point, you may need to add your vCenter server, local registry (if configured), and the CIDR range for the ESX hosts. The Last section is the only one that comes partly pre populated during configuration file generation:

dataDiskName dataDiskMB Name of the VM Amount of cpus Amount of memory in MB Base disk size in GB NOTE:

dataDisk folder where new disk is created must exist. As result you must create it manually upfront. The admin workstation can be assigned an IP address using a static IP assignment or by a DHCP server. Your implementation choice is defined in the network subsection of the configuration file using the ipAllocationMode property. For DHCP use cases ipAllocationMode must be defined as DHCP, and all other child network configuration elements remain undefined. When using a static IP assignment, the ipAllocationMode property must be set to “static”, followed by IP, gateway, netmask and DNS configuration. The DNS value can be defined as an array of properties with multiple values. Finally, set the NTP server used by the admin workstation. It’s mandatory to use NTP that is in sync with vSphere infrastructure otherwise time differences will cause deployment failures. Two example configuration files are shown below, the first has been configured to use a static IP and the second has been configured to use DHCP.

adminWorkstation: name: "gke-admin-ws-200617-113711" cpus: 4 memoryMB: 8192 diskGB: 50 dataDiskName: "gke-on-prem-admin-workstation-data-disk/gke-admin-ws-data-d dataDiskMB: 512 network:

ipAllocationMode: "static" hostConfig: ip: "10.20.20.10" gateway: "10.20.20.1" netmask: "255.255.255.0" dns: - "172.16.255.1" - "172.16.255.2” proxyUrl: "https://my-proxy.example.local:3128" ntpServer: "myntp.server.local" adminWorkstation: name: "gke-admin-ws-200617-113711" cpus: 4 memoryMB: 8192 diskGB: 50 dataDiskName: "gke-on-prem-admin-workstation-data-disk/gke-admin-ws-data-di dataDiskMB: 512 network: ipAllocationMode: "dhcp" hostConfig: ip: "" gateway: "" netmask: "" dns: proxyUrl: "https://my-proxy.example.local:3128" ntpServer: "myntp.server.local"

Now we can create the admin workstation on our vSphere infrastructure using the gkeadm utility. ./gkeadm create admin-workstation --auto-create-service-accounts

Adding the auto-create-service-accounts flag allows you to automatically create the associated service accounts in your project. Once the admin workstation has been created, you are ready to deploy the admin cluster. In the next section, we will go through the steps to create your admin cluster. Creating an Admin Cluster The admin cluster is the key component of the Anthos control plane. It is responsible for the supervision of Anthos on VMware implementations, and

the provisioning and management of user clusters. It’s deployed as a Kubernetes cluster using a single control plane node and two worker nodes [Fig 8]. The control plane node will provide the Kubernetes API server for the admin control plane, the admin cluster scheduler, the etcd database, audit proxy and any load balancer integrated pods. Worker nodes are providing resources for Kubernetes addons like kube-dns, cloud monitoring (former stackdriver) or vSphere pods. In addition to the admin control plane and addons, the admin cluster hosts the user control planes. As a result, the user clusters API server, scheduler, etcd, etcd maintenance and monitoring pods are all hosted on the admin cluster. Figure C.4 Anthos on VMware admin cluster architecture

To create an admin cluster, you will need to SSH into your admin workstation that was created in the last section. SSH into the admin workstation using the key that was created when you deployed the admin workstation, located at .ssh/gke-admin-workstation..

ssh -i /usr/local/google/home/me/.ssh/gke-admin-workstation ubuntu@{admin-wo

Similar to the admin workstation creation process, the admin cluster uses a YAML file that is divided into sections. The vCenter, gkeconnect, stackdriver and gcrkeypath sections are pre populated with values gathered from the admin workstation YAML[197] file, while all other sections must be filed in

for your deployment, prior to creating the cluster. You can use the included admin cluster configuration file, or you can generate a new admin cluster configuration file using the gkectl tool that is already installed on admin workstation.. Unlike the pre-created template file, any templates generated manually using gkectl will not contain any prepopulated values. To create a new file, use the gkectl create-config admin option. gkectl create-config admin --config={{ OUTPUT_FILENAME }}

Both creation methods will contain the same sections, the first two sections of the configuration file must remain unchanged, defining the API version and cluster type. apiVersion: v1 kind: AdminCluster

The next section is dedicated to the vSphere configuration, containing the requirements for the Virtual Machines and disks placement. TIP:

A good practice is to always use fully qualified domain name for vCenter and avoid usage of IP addresses in production environments.

vCenter: address: "FullyQualifiedDomainName or IP address of vCenter server" datacenter: "vCenter Datacenter Name" cluster: "vCenter Cluster name" resourcePool: "vCenter Resource Pool name" datastore: "vCenter Datastore Name for GKE VM placement " folder: “Optional: vCenter VM Folder” caCertPath: "vCenter public certificate file" credentials: fileRef: path: “path to credentials file” Entry: “Name of entry in credentials file referring to username and pas # Provide the name for the persistent disk to be used by the deployment (e # in .vmdk). Any directory in the supplied path must be created before dep dataDisk: "Path to GKE data disk"

TIP:

A good practice is to place the data disk into a folder. The dataDisk property must point to the folder that exists. Anthos on VMware creates a virtual machine disk (VMDK) to hold the Kubernetes object data for the admin cluster, which is created by the installer for you, so make sure that name is unique. TIP:

If you would prefer to not use Resource Pool and place admin cluster resources directly under cluster level, provide “/Resources” in resoucrePool configuration. In the next section we define the IP settings for admin cluster nodes, services and pods. These settings will also be used for the user cluster master nodes deployment. First, we must define whether the Anthos on VMware admin cluster nodes and user cluster master nodes will use DHCP or static IP assignments. If a static option is used, an additional YAML file must be defined for IP address assignments; this file is specified in the ipBlockFilePath property. The next two properties are dedicated for the Kubernetes service and pod CIDR ranges, which are detailed in Table C.1 below. They are used by Kubernetes pods and services and are described in detail in the chapter Computing environment built on Kubernetes. The assigned network ranges must not overlap between each other or with any external services that are consumed by the management plane, for example, any internet proxy used for communication with GCP. TIP:

Due to fact that Anthos on VMware operates in Island Mode IP addresses used for Pods and Services are not routable into dataceneter network. That means you can use same IPs for every new cluster. Finally, the last section defines the target vSphere network name that the

Kubernetes Nodes will use once provisioned. Table C.1 Admin Cluster Properties

Property key

Property description

network

ipMode

Parent key for type and ipBlockFilePath

type

IP mode to use ("dhcp" or "static")

Path to yaml configuration file used for static IP ipBlockFilePath assignment. Must be used in conjunction with type: static key value pair

serviceCIDR

Kubernetes service CIDR used for control plane deployed services. Minimal size 128 addresses

podCIDR

Kubernetes pods CIDR used for control plane deployed services. Minimal size 2048 addresses

vCenter

Parent key for networkName

networkName

vSphere portgroup name where admin cluster nodes and user cluster master nodes are assigned to.

An example configuration is shown below. network: ipMode: type: dhcp serviceCIDR: 10.96.232.0/24 podCIDR: 192.168.0.0/16 vCenter: networkName: "My Anthos on VMware admin network"

As we mentioned, the Node IP assignments can be configured via a static configuration file. The path to such a file must be specified under ipBlockFilePath key that must be uncommented and taken into account only when ipMode.type key is set to static. Additionally DNS and NTP servers must be specified and search domain defined, as shown in the example below. network: ipMode: type: “static” ipBlockFilePath: "myAdminNodeHostConfFile.yaml" hostConfig: dnsServers: - "8.8.8.8" ntpServers: - "myNTPServer" searchDomainsForDNS: - "myDomain.local"

The static host configuration file is built using two main configuration keys: hostconfig and blocks. The Hostconfig defines information about DNS servers, NTP servers and search domains. The blocks define netmask and gateway for Kubernetes nodes followed by an array of hostnames and corresponding IP addresses for them. Property key Property description

blocks

netmask

Network netmask

gateway

Network gateway

ips

Array of ip and hostname keys with corresponding values.

blocks: - netmask: 255.255.255.128 gateway: 10.20.0.1 ips: - ip: 10.20.0.11 hostname: admin-host1 - ip: 10.20.0.12 hostname: admin-host2 - ip: 10.20.0.13 hostname: admin-host3 - ip: 10.20.0.14 hostname: admin-host4 - ip: 10.20.0.15 hostname: admin-host5 TIP:

IP addresses assigned to nodes are not assigned in the order defined in the file. They are randomly picked from the pool of available IPs during resizing and upgrade operations. The next section of the configuration is for the cluster Load Balancing. Anthos on VMware requires a load balancer to provide a Virtual IP (VIP) to the Kubernetes API server. For your clusters you can choose between using the integrated load-balancer, based on MetalLB, using a F5, or any other load-balancer using a manual configuration. MetalLB is becoming a popular solution for bare-metal based implementations including VMware[198] outside of Hyperscaller build in solutions. Enablement of MetalLB on the admin cluster is limited to definition of kind: MetalLB in the admin cluster configuration file as presented below. loadBalancer:

vips: controlPlaneVIP: “133.23.22.100” kind: MetalLB

We will explain the options in greater detail in the Load Balancer section of this chapter. To make sure that Kubernetes control plane nodes will be distributed across different ESXi hosts, Anthos supports vSphere anti-affinity groups. Such an implementation guarantees that a physical ESXi host failure will only impact a single Kubernetes node or addon node providing a production-grade configurations control plane. This value should be set to true to leverage antiaffinity rules, or false to disable any use of anti-affinity rules. antiAffinityGroups: enabled: true/false

You can monitor the cluster using Google Cloud Logging by setting the appropriate values in the stackdriver section of the configuration file. Logs and metrics can be sent to a dedicated GCP project, or the same project where the cluster is being created. You will need to supply the projectID that you want to use for the logs, the cluster location, VPC options, the service account key file with the appropriate permissions to the project, and your decision to enable or disable vSphere metrics. stackdriver: projectID: "my-logs-project" clusterLocation: "us-central1" enableVPC: false serviceAccountKeyPath: "my-key-folder/log-mon-key.json" disableVsphereResourceMetrics: true

Moreover, you can also integrate audit logs from the cluster's API server with Cloud Audit Logs. You must specify the project that integration should target (it can be the same project used for your Cloud Operations integration), the cluster location, and a service account key with the appropriate permissions. cloudAuditLogging: projectID: "my-audit-project" clusterLocation: "us-central1"

serviceAccountKeyPath: "my-key-folder/audit-log-key.json"

It’s important to make sure that problems on Kubernetes nodes are detected and fixed quickly, and similar to a GKE cluster, Anthos on VMware uses a Node Problem Detector. The detector watches for possible node problems and reports them as events and conditions. When any Kubelet becomes unhealthy or ContainerRuntimeUnhealthy conditions are reported by the kubelet or docker systemd service, the auto repair functionality will try to restart them automatically. Anthos on VMware clusters auto repair functionality allows to automatically creating Kubernetes node VMs when they are deleted by mistake or recreation of unresponsive faulty Virtual Machines. It can be enabled or disabled in the cluster deployment configuration file, by setting the autoRepair enabled option to either true or false. autoRepair: enabled: true/false

When enabled, cluster-health-controller deployment is created on the corresponding cluster in the kube-system namespace. If the node is indicated as unhealthy, it is drained and recreated, as shown in Figure C.5. Figure C.5 Node auto repair process

NOTE:

To disable auto repair functionality on admin cluster cluster-health-controller deployment must be deleted from admin cluster. It is possible to deploy Anthos on VMware from a private docker registry instead of gcr.io. To configure your deployment to use a private registry, you need to set the values in the privateRegistry section of the configuration file.

You will need to supply values for the registry address, the CA for the registry, and the reference to the credentials to use in the credentials file. privateRegistry: address: "{{Private_Resistry_IP_address}}" credentials: fileRef: path: "{{my-config-folder}}/admin-creds.yaml" entry: "private-registry-creds" caCertPath: "my-cert-folder/registry-ca.crt"

That completes the admin cluster configuration file configuration, now let’s move to the user cluster configuration. Security is very important for Anthos based Kubernetes implementation. Anthos on VMware introduced secret encryption capability to ensure they are encrypted at rest without requirement for external Key Management Service. As a result, before a secret is stored in the etcd database, it is encrypted. To enable or disable that functionality edit the secretsEncryption section of the configuration file. secretsEncryption: mode: GeneratedKey generatedKey: keyVersion: 1 TIP:

Anytime key version is updated a new key is generated and secrets are reencrypted using that new key. You can enforce key rotation using gkectl update command, as result all existing and new secrets are encrypted in use of new key and old one is securly removed. User Cluster Creation Each new user cluster is required to be connected to an admin cluster, in fact, there is no way to create a workload cluster without an admin cluster. A single admin cluster can manage multiple user clusters but a single user cluster can be supervised by only one admin cluster.

Each provisioned user cluster can be deployed in two configurations, with, or without, a Highly Available (HA) production-grade management plane. As presented in the below drawing, clusters with HA enabled are built with 3 admin nodes (User Cluster #2 in the drawing), and without HA, using a single node (User Cluster #1 in the drawing). The Single node management plane consumes less compute resources but in case of a node or physical host failure, the ability to manage the cluster is lost[199]. In HA mode, a single master node failure does not impact the Kubernetes cluster configuration management ability. Figure C.6 Anthos on VMware user clusters architecture

IMPORTANT:

After deployment, the number of admin nodes cannot be changed without full cluster re-creation. Every new user cluster is placed in a dedicated namespace in the admin cluster. It’s used to host and deliver services, deployments, pods and ReplicaSets for management purposes. The Namespace name is inline with the cluster name, allowing you to easily grab all details by referring to it via kubectl get all -n {{ clusterName }}. Any user cluster namespace will be hosted on dedicated nodes that are added to the admin cluster when you create the user cluster. The nodes will be labeled with the cluster name and when the cluster management pods are created, they will use a node selector to force their placement on the dedicated user cluster nodes. Other, non-system namespaces, are created on top of workload cluster nodes as presented on below picture Figure C.7 Anthos on VMware user cluster namespaces

Similar to an admin cluster, a user cluster deployment is based on a YAML configuration file. The first two sections of the configuration file must remain unchanged, defining the API and cluster type. apiVersion: v1 kind: UserCluster TIP: You can convert old versions of configuration files using

gkectl create-config cluster --config $MyAwsomeClusterConfigFile.yaml --from

The next section is where you provide the name of the new cluster and the Anthos on VMware version. The cluster name must be unique within a GCP project and the version must be inline with the admin cluster version. name: “MyAwesomeOnPremCluster” gkeOnPremVersion: 1.10.0-gke.194

The next section is optional. It is used to manage vSphere integration and worker nodes placement. It is strongly recommended to separate admin and workload compute resources on vSphere level to ensure Kubernetes management plane availability. This guarantees resources for each Anthos on VMware cluster in case of resource saturation or limited access to vSphere. Following that practise your user cluster worker nodes will be placed in resource pool and datastore[200] defined under the vCenter section. Additionally user clusters can be deployed into separate VMware datacenters if required. To make sure rights separation is properly applied for vSphere resources if’s recommended to use dedicated account for user cluster vCenter communication.

vCenter: “MyAwsomeOnPremCluster” datacenter: “MyWorkloadDatacenter” resourcePool: "GKE-on-prem-User-workers" datastore: “DatastoreName” credentials: fileRef: path: “path to credentials file” Entry: “Name of entry in credentials file referring to username and pa TIP:

You can decide to use only a single property like vCenter.resourcePool. In such case comment other lines adding # at the beginning of line and configuration of commented property will be inherited from admin cluster configuration. The networking section has the same structure as the admin nodes described in the admin cluster section extended with capability to define additional network interfaces that can be used for Kubernetes payload. network: ipMode: type: dhcp serviceCIDR: 10.96.232.0/24 podCIDR: 192.168.0.0/16 vCenter: networkName: "My Anthos on VMware user network" additionalNodeInterfaces: - networkName: "My additional network"

type: dhcp

Or in case of static IP assignment: network: ipMode: type: “static” ipBlockFilePath: "myNodeHostConfFile.yaml" additionalNodeInterfaces: - networkName: "My additional network" type: “static” ipBlockFilePath: "mySecondNodeHostConfFile.yaml"

At the beginning of that chapter we stated that the admin plane can be HA protected or not. Such a decision is configured under the masterNode.replicas section of the cluster configuration file by definition adequately of 3 or 1 replicas. We can also scale up the cpu and memory of master nodes if required under this section or setup auto resize capability. masterNode: cpus: 4 memoryMB: 8192 replicas: 3 masterNode: autoResize: enabled: true TIP:

Configuration file is key: value based. All values defined under quotation marks “” are interpreted as strings and without quotation marks as integers. All number based configuration elements like amount of replicas, cpus or memory must be specified as integers User cluster worker nodes are defined as pools of nodes. This allows you to have different sizes of nodes in the same Kubernetes cluster with labels and taints applied to node objects. Finally, the last configuration element of each defined node pool is the node Operating System, offering either Google’s hardened Ubuntu image or Google’s immutable Container-Optimized OS

(COS). If using bundled loadbalancer type - MetalLB - at least one of the pools must have enableLoadBalancer configuration set as true. nodePools: - name: “My-1st-node-pool” cpus: 4 memoryMB: 8192 replicas: 3 bootDiskSizeGB: 40 labels: environment: "production" tier: "cache" taints: - key: "staging" value: "true" effect: "NoSchedule" vsphere: datastore: "my-datastore" tags: - category: "purpose" name: "testing" osImageType: "cos" enableLoadBalancer: false NOTE:

Worker node virtual machines will be named inline with defined node pool name followed by random numbers and letters i.e. My-1st-node-poolsxA7hs7. During the cluster creation, anti-affinity groups are created on vSphere with the worker nodes placed inside them. This vSphere functionality allows us to distribute the worker node VMs across different vSphere hosts in a cluster, avoiding placing too many nodes on the same physical host. As a result, in the case of a VMware ESXi host failure, only a limited number of Kubernetes nodes are impacted, decreasing impact into hosted services. To enable antiAffinityGroups, it’s mandatory to have at least 3 ESXi hosts in the vSphere cluster. You can enable or disable this feature under the configuration file antiAffinityGroups.enabled section by changing the default value to either true or false.

antiAffinityGroups: enabled: true

By default all workload clusters can be accessed in use of auto generated kubeconfig files. In such cases no additional configuration is required but access scope is not limited and significantly hard to manage. To solve this problem, Anthos on VMware clusters have an option to integrate into external identity providers via OpenID Connect (OIDC) and grant access to namespaces or clusters using Kubernetes authorization (Fig C.8). Figure C.8 Kubernetes Cluster and namespace based access

You can integrate the cluster into an existing Active Directory Federation Services (ADFS), Google, Okta or any other certified OpenID provider[201]. To configure these settings, we must provide all provider specific information to leverage Anthos Identyty Service and edit ClienConfig file after cluster creation[202].

TIP: To restore user cluster kubeconfig you can trigger:

kubectl --kubeconfig $ADMIN_CLUSTER_KUBECONFIG get secrets -n $USER_CLUSTER_

The next option, similar to admin cluster, is the option to enable or disable auto repair functionality on configuration file level autoRepair: enabled: true/false

The key difference from the admin cluster configuration is that it can be easily changed by applying changes into the YAML configuration file and triggering a gkectl update command. The last part to be covered before we move on to networking is storage. By default Anthos on VMware include the vSphere Kubernetes volume plugin that allows dynamically provisioning vSphere VMDK disks on top of datastores[203] attached to vCenter clusters. After a new user cluster is created, it is configured with a default storage class that points to the vSphere datastore. Besides the volume connector, newly deployed clusters automatically get the vSphere Container Storage Interface (CSI). The CSI is a standard API which allows you to connect directly to compatible storage, bypassing vSphere storage. It’s worth mentioning that Anthos on VMware clusters still support the use of in-tree vSphere Cloud Provider volume plugin that enables a direct connection to storage, bypassing vSphere storage as well. However, due to known limitations, like lack of dynamic provisioning support, it’s not recommended to use the in-tree plugin - you should use the CSI driver instead. We managed to define compute and storage components that are used for Anthos on VMware deployment. Let's summarize it. Our build is based on the admin workstation, admin cluster and user clusters deployed and hosted on a vSphere environment. The picture below presents resource separation based on resource pools dedicated for every user cluster and combined admin cluster resources with user master nodes. Figure C.9 Anthos on VMware resource distribution

You have learned a lot about the compute part of Anthos on VMware implementation. In the next section, we will go through details for communication capabilities, requirements and limitations depending on network made implementation choice.

C.2.1 Anthos Networking To understand the role that networking plays in an Anthos cluster, we need to understand that an Anthos cluster consists of two, different, networking models. The first is the vSphere network where the entire infrastructure is placed and the other is Kubernetes networking. At the beginning of this chapter we stated that Anthos on VMware does not require any Software Defined Networking applied on top of vSphere

infrastructure and can be fully VLAN based. IP Management As we already familiarized ourselves with deployment flow, let’s go deeper in configuration and architecture elements. Besides the configuration files mentioned in previous chapters, there can be additional required depending on the deployment scenario. When DHCP is used, all nodes have assigned IP addresses from it as presented on Figure C.10. Figure C.10 DHCP based deployment

If deployment does not utilize DHCP services for node IP allocation, additional host configuration files must be created for the admin cluster and each user cluster (Figure C.11). Some organizations consider statically assigned addresses to be the most stable implementation since it removes any DHCP problems or lease expiration for nodes will not introduce any disturbance for node communication. However, while it may eliminate any DHCP concerns, it introduces management overhead for host configuration files preparation and scalability limitations of created clusters. The best implementation is something that you must decide for your cluster and organization. Figure C.11 Static IP assignment scenario

It’s possible to follow a mixed deployment scenario where both DHCP based and non DHCP based clusters are deployed as presented in the picture below. Figure C.12 Mixed DHCP and Static IP assignment scenario

With this configuration, we can have an Admin cluster using static IP addresses for its management plane and user Kubernetes master nodes, DHCP for 1st user cluster and static IP addresses for 2nd user cluster Kubernetes worker nodes or the complete opposite. As IP address change of Kubernetes nodes introduces significant problems for storage access, it’s recommended to use static IP assignment for admin cluster nodes where DHCP can be used for short-lived user clusters. There are a few constraints related to mixed deployment: IP assignment must be identical for the entire admin cluster and user cluster master nodes as they share the same IP address pool All user cluster worker node pools must use the same IP assignment method for the entire cluster Separate user clusters can use different IP assignment methods even if they are managed by the same admin cluster

So far, we have talked about IP assigned options and arguments for and against each of the implementations. Now let's talk about detailed network implementation configuration, good practices and recommendations for both management and workload. Management plane Looking deeper into the management plane network configuration there are two elements we need to communicate with, depending on activity that is intended to be performed. The first element we must deploy for Anthos on VMware is the admin workstation, which is fully preconfigured and hardened by Google. The Second communication point is the entire admin cluster, hosting all admin nodes and user cluster master nodes. Both mandate communication with VMware infrastructure to automatically deploy Virtual Machines. There is no technical requirement to separate admin workstation, nodes and vSphere infrastructure but from a security perspective it is highly recommended to isolate those networks on layer 2 as presented in the picture below. Figure C.13 Anthos on VMware vSphere networking

As Anthos on VMware clusters are integrated into GCP console they mandate communication with the external world. This connection can be achieved using a direct internet connection or via an internet proxy.

The default networking model for a new Anthos cluster is known as island mode. This means that Pods are allowed to talk to each other but prevented by default from being reached from external networks. Another important note is that outgoing traffic from Pods to services located outside are NATed by node IP addresses. The same applies to services. They can overlap between clusters but must not overlap with Pod subnet (Figure C.14). Additionally Pods and Services subnets must not overlap with external services consumed by cluster i.e. internet proxy or NTP[204] otherwise traffic will not be routed outside the created cluster. Figure C.14 Pods and services

Both service CIDR and Pod CIDR are defined in the admin cluster

configuration YAML file, and built-in preflight checks make sure that IP addresses for both do not overlap. network: serviceCIDR: 10.96.0.0/16 podCIDR: 10.97.0.0/16

Load Balancers To manage a Kubernetes cluster you must reach it’s Kubernetes API server. Admin cluster exposes it via LoadBalancer IP that can be configured in 3 flavours depending on type. At the time of writing this chapter, Anthos on VMware supports the following types: MetalLB, F5BigIp and ManualLB that replace SeeSaw. Bundled - MetalLB MetalLB is Google developed open source[205] Cloud Native Computing Foundation sandbox project network load-balancer implementation for Kubernetes clusters. It’s running on bare metal implementations allowing use of LoadBalancer services inside any cluster. MetalLB addresses two requirements that are part of hyperscaller Kubernetes implementations but lacking in on-premises ones, external announcement and address allocation. Address allocation provides you capability to automatically assign IP address to LoadBalancer service that is created without a need to specify it manually. Moreover you can create multiple IP address pools that can be used in parallel depending on your needs, for example pool for private IP addresses that are used to expose services internally and pool of IP addresses that provides external access. As soon as an IP address is assigned it must be announced on the network, and their external announcement feature is coming into play. MetalLB can be deployed in two modes: layer 2 mode and BGP mode. Current implementation of Anthos on VMware is using layer 2 mode only. In layer 2 implementation external announcement is managed in use of standard address discovery protocols: ARP for IPv4 and NDP for IPv6. Each

Kubernetes service is presented as dedicated MetalLB load balancer, as result when multiple services are created traffic is distributed across load balancer nodes. Such implementation has advantages and constraints. Key constraint is related to the fact that all traffic for service IP is always going to one node where kube-proxy spreads it to all service pods. As a result service bandwidth is always limited to single node network bandwidth. In case of node failure service is automatically failed over. Such a process should take no longer than 10 seconds. When looking at advantages for MetalLB layer 2 implementation for sure we must mention the fact that it’s fully in cluster implementation without any special requirements from physical networks in the area of hardware etc. Layer 2 implementations do not introduce any limitation to the amount of load balancers created per network as soon as there are IP addresses available to be assigned. That is a consequence of using memberlist Go library to maintain cluster membership list and member failure detection using gossip based protocol instead of for example Virtual Router redundancy Protocol[206]. Figure C.15 Admin cluster networking with metalLB

INFO:

Admin Cluster Kubernetes VIP is not leveraging MetalLB as the admin cluster does not HA implement. All User Clusters are using MetalLB deployment for it’s Kubernetes VIP exposure. MetalLB is part of Anthos on VMware covered under Anthos license and standard support inline with the chosen support model, including lifecycle management activities for each release. Integrated - F5

The second option to introduce load-balancer capabilities is integration with F5 BIG-IP load balancer - called integrated load balancer mode. Compared to MetalLB, F5 infrastructure must be prepared upfront and is not deployed automatically by Google. For Anthos on VMware, the BIG-IP provides external access and L3/4 load-balancing services. When integrated load balancer mode is defined, Anthos on VMware automatically performs preflight checks and installs single supported version of F5 BIG-IP container ingress services (CIS) controller. Figure C.16 Admin Cluster networking with F5 BIG-IP

Production license provides up to 40 Gbps throughput for Anthos on VMware load balancer. BIG-IP integration is fully supported by Google inline with the support compatibility matrix but licensed separately under F5 licensing.

Manual Load Balancer To allow flexibility and capability to use existing load balancing infrastructure of your choice, Anthos on VMware can be deployed in use of manual mode configuration of load balancer. In such an implementation there is a need to set up a load balancer with Kubernetes API VIP before cluster deployment starts. Configuration steps depend on the load balancer you are using. Google provides detailed documentation describing BIG-IP and Citrix configuration steps. It’s important that in manual mode you cannot expose Services of type LoadBalancer to external clients. Due to lack of automated integration with manual load balancing mode, Google does not provide support and any encounter issues with the load balancer must be managed with the load balancer's vendor. We learned already about all three mode types. Let's have a look into configuration files. Configuration is covered under the dedicated loadBalancer section of the admin config yaml file and will vary depending on chosen option. For MetalLB configuration in the admin cluster we must define the loadbalancer kind as MetalLB and provide Kubernetes API service VIP. loadBalancer: vips: controlPlaneVIP: "203.0.113.3" kind: MetalLB

When F5 BIG-IP integrated mode is chosen, the load balancer section must be changed to kind: F5BigIP. Entire MetalLB section (enabled by default on new generated config file) must be commented and f5BigIp section must be defined with credentials file and partition details. loadBalancer: vips: controlPlaneVIP: "203.0.113.2" kind: F5BigIP f5BigIP: address: "loadbalancer-ip-or-fqdn" credentials:

fileRef: path: “name-of-credential-file.yaml” entry: “name of entry section in above defined file” partition: “partition name” snatPoolName: “pool-name-if-SNAT-is-used”

Last use case is covering manual load balancing. In such a scenario we must define kind: ManualLB and comment out the seesaw section. Next we must manually define the NodePort configuration options. loadBalancer: vips: controlPlaneVIP: "203.0.113.2" kind: ManualLB manualLB: controlPlaneNodePort: “9000” addonsNodePort: “9001”

User Clusters User cluster networking is based on the same principles as for admin clusters, externed for workload deployment capabilities. Every cluster is deployed in Island Mode, where services and Pods CIDRs must not overlap in the user cluster configuration file. network: serviceCIDR: 10.96.0.0/16 podCIDR: 10.97.0.0/16

Again we have three modes of load-balancer deployment and integration: bundled, integrated and manual. Bundled deployment similarly uses MetalLB implementation. You already learned that master nodes of the user cluster are deployed into the admin cluster node network and IPs are assigned from predefined static pool or DHCP servers. Kubernetes API service is also exposed automatically in the same network but its IP address must be defined manually in the user cluster configuration file. User cluster worker nodes can be deployed into the same network as admin nodes or into separate dedicated network. The second option is preferred as it

allows the separation of traffic from the management plane. Each new user cluster comes with a new dedicated MetalLB load-balancer for the dataplane. Control plane VIP for Kubernetes API is always co hosted in the admin cluster load balancer instance. User Cluster ingress VIP is automatically deployed into the cluster worker nodes pool network. As a result you can limit MetalLB to be hosted on a dedicated node pool instead of all nodes in the user cluster. Figure C.17 User Cluster networking with MetalLB

In multi-cluster deployment user clusters can share a single network or use a dedicated one. When using a single one, make sure that node IPs are not overlapping in configuration files. As we already mentioned, MetalLB is a fully bundled load-balancer. That means every time when service type LoadBalancer is created VIP is automatically created on load-balancer and traffic sent to the VIP is forwarded to Service. MetalLB has IP address management (IPAM) as a result IP addresses for each service are assigned automatically. Definition of IP pools and nodes that host MetalLB VIPs is defined in user-cluster load balancer configuration file section as presented bellow

loadBalancer: vips: controlPlaneVIP: "Kubernetes API service VIP" kind: MetalLB metalLB: addressPools: - name: "name of address pool" addresses: - "address in form of X.X.X.X/subnet or range X.X.X.X-X.X.Y.Y” # (Optional) Avoid using IPs ending in .0 or .255. avoidBuggyIPs: false # (Optional) Prevent IP addresses to be automatically assigned from this poo manualAssign: false

Additionally we must allow MetalLB service on one (or more) Virtual Machine pools in worker pools section described already in details earlier in the book by defining configuration in form “enableLoadBalancer: true” Integrated load balancing allows integration into F5 BIG-IP and provision LoadBalancer service type automatically similar to the way we described for admin cluster integration. In that scenario, the integration point can be the same for admin and user clusters and there is no need to use separate instances of F5s. It’s important to remember that every new cluster must be pre-configured and properly prepared on the load balancer side before deployment. Figure C.18 User Cluster networking with F5 BIG-IP

Manual mode implementations for user cluster load balancers are following the same rules, introducing the same constraints and limitations as for admin clusters. Every service exposure mandates contact with external teams and manual activities as described in official documentation[207]. In this section of Anthos on VMware, we went through different network configuration options for management and workload clusters. It’s important to remember that similarly to IP assignment, we can decide to use a single load-balancer integration mode or choose different modes for different clusters. As a result we can have a MetalLB based management cluster, one MetalLB user cluster, second F5 BIG-IP integrated user cluster and third manually integrated Citrix Netscaler load balancer. loadBalancer: vips: controlPlaneVIP: "Kubernetes API service VIP" ingressVIP: “Ingress service VIP (must be in use node network range) kind: MetalLB

MetalLB: addressPools: - name: "my-address-pool-1" addresses: - "192.0.2.0/26" - "192.0.2.64-192.0.2.72" avoidBuggyIPs: true loadBalancer: vips: controlPlaneVIP: "Kubernetes API service VIP" kind: F5BigIP f5BigIP: address: "loadbalancer-ip-or-fqdn" credentials: fileRef: path: “name-of-credential-file.yaml” entry: “name of entry section in above defined file” partition: “partition name” snatPoolName: “pool-name-if-SNAT-is-used” loadBalancer: vips: controlPlaneVIP: "Kubernetes API service VIP" kind: ManualLB manualLB: ingressHTTPNodePort: Ingress-port-number-for-http ingressHTTPSNodePort: Ingress-port-number-for-https controlPlaneNodePort: “NodePort-number-for-control-plane-service”

C.2.2 GCP integration capabilities In previous sections we talked about compute and network architecture of Anthos on VMware. In that section we will cover different integration capabilities for GCP services to leverage a single panel of glass for all Anthos clusters regardless if they are deployed on premise or on other clouds . As you already noticed during deployment of new admin and user clusters it’s mandatory to define a GCP project that it will be integrated into. That enables the connect agent to properly register and establish communication with the hub as described in detail in the chapter Operations Management. In general GKE Connect is performing two activities, enabling connectivity and authentication to register new clusters. Figure C.19 Fleet relationship to Connect Agents

For that purpose two dedicated service accounts are used. That section is mandatory and must be properly defined for the user cluster. Note that due to the fact that the agent service account is using Workload Identity functionality it does not require a key file. gkeConnect: projectID: “My-awsome-project” registerServiceAccountKeyPath: register-key.json NOTE:

GKE connect plays a significant role in on prem integration into GCP. That allows us to utilize directly from GCP console Cloud Marketplace, Cloud

Run, options to integrate with CICD toolchain via Anthos authorization without a need to expose Kubernetes API externally. Anthos on VMware has the ability to send infrastructure logs and metrics to GCP Cloud Monitoring. It applies for both admin and user clusters. We can choose to send only Kubernetes related metrics or include vSphere metrics as well. Metrics for each cluster can be sent to different projects. stackdriver: projectID: “My-awsome-project” clusterLocation: gcp-region-name enableVPC: false/true serviceAccountKeyPath: monitoring-key.json disableVsphereResourceMetrics: true/false

Another integration feature that is applied for both admin and user clusters is the capability to send Kubernetes API server audit logs. As previously described, we can choose the project and location region where logs are stored and the service account that is used for that integration. cloudAuditLogging: projectID: “My-awsome-project” clusterLocation: gcp-region-name serviceAccountKeyPath: audit-key.json

Last two integration features are applied to user clusters only. First of them is an option to consume from Google Cloud Platform console CloudRun for Anthos (described in detail in “Anthos, the serverless compute engine (Knative)” chapter) and deploy services directly into Anthos on VMware clusters. There is not much configuration there as the service itself is leveraging connect functionality and deploying Knative in a dedicated namespace of the user cluster. That means it must be enabled on the same project as Anthos on which the VMware cluster is registered. You will learn more about CloudRun for Anthos and Knative on Anthos Cloud Run chapter and related documentation[208]. Feature list description is closed by metering. After enabling the metering feature, the user cluster sends resource usage and consumption data to Google Bigquery. It allows us to analyze it and size our clusters inline with actual demand or expose them and present them as reports for example in

Data Studio. usageMetering: bigQueryProjectID: “My-awsome-project” bigQueryDatasetID: dataset-name bigQueryServiceAccountKeyPath: metering-key.json enableConsumptionMetering: false/true

C.3 Summary Anthos on VMware is a great option to consume cloud native capabilities on top of existing vSphere infrastructure Architecture is composed from two elements: user clusters that are responsible for delivery of resources for hosted applications admin control plane that is responsible for management and control of deployed user clusters It can be used as an add-on to co-hosted Virtual Machines as well as dedicated on prem Kubernetes implementations that are ready for hybrid cloud implementation of cloud native journey. We can leverage existing VMware skills for infrastructure management, keep full visibility of Cloud Operations and consume GCP services if needed from your own data center. Setup can be independent and self contained in use of bundled features like MetalLB or integrated into existing infrastructure and what automation capabilities and constraints it brings Cluster configurations may vary depending on purpose, size and availability requirements. [193]

https://cloud.google.com/anthos/gke/docs/on-prem/how-to/vsphererequirementsbasic#resource_requirements_for_admin_workstation_admin_cluster_and_user_clusters [194] [195]

https://developers.google.com/identity/protocols/oauth2/openid-connect

https://cloud.google.com/anthos/gke/docs/on-prem/how-to/vsphererequirements-basic

[196]

https://cloud.google.com/anthos/clusters/docs/on-prem/version-history

[197]

Sections are populated if --auto--create-service-accounts flag is used

[198]

https://metallb.universe.tf/installation/clouds/

[199]

vSphere High Availability feature can mitigate that behavior and decrease downtime of Kubernetes API to minutes until VM is restarted on new host. vSphere HA will not protect against Virtual Machine corruption. [200]

For vSAN based deployments all nodes must be placed on the same datastore. [201]

Anthos on VMware supports all certified OpenID providers. Full list can be found under https://openid.net/certification/ [202]

https://cloud.google.com/anthos/identity/setup/per-cluster

[203]

vSphere datastores can be backed by any block device, vSAN or NFS storage [204]

Network Time Protocol

[205]

https://github.com/metallb/metallb

[206]

https://tools.ietf.org/html/rfc3768

[207]

https://cloud.google.com/anthos/clusters/docs/on-prem/how-to/manualload-balance [208]

https://cloud.google.com/anthos/run/docs/install/outside-gcp/vmware

Appendix D Data and analytics This chapter covers: Understanding portability versus mobility Kubernetes and storage Anthos and storage BigQuery Omni Anthos Hybrid AI Until very recently, Anthos was often referred to in the context of compute and network, and rarely, if ever, in the context of storage or data. Nevertheless, Anthos has the potential to bring to stateful workloads many of the fundamentals it introduces in terms of observability, security, configuration and control. Furthermore, Anthos has the potential to disrupt the reachability and accessibility of data as much as it has been disrupting the portability and mobility of compute. Through Anthos, containers can seamlessly reach storage devices across hybrid and multi-cloud environments, with a consistency in provisioning, configuring, allocating and accessing these resources. Once an Anthos cluster enables stateful containers to be deployed across hybrid and multi-cloud environments, it provides those containers reach and access to data in a secure and consistent manner, the way analytics gets performed becomes ripe for a redesign. In a traditional setting, data collected from a myriad of data sources gets stored in a central location, in the form of data warehouses or data lakes, which is where analytics are executed. As a result, analytics solutions, including the ones using Artificial Intelligence (AI) and Machine Learning (ML), have long been designed and optimized assuming all the data required could be centrally reached. Recent events, however, have been challenging the ability to centralize all the data before analytics can be performed. Organizations typically have data scattered across organizational boundaries or geographic locations, be it hosted on-prem or in public clouds. Often, regulatory compliance, security or

data privacy impose some form of data residency where data must be stored within physical or logical fences and cannot be moved outside of that. Furthermore, even if the data could cross those boundaries, bandwidth constraints, such as capacity, speed and cost, might have imposed limitations on the volume of data that can be transferred to a central destination. Anthos makes it easier to bring data analytics to your data, meeting the data where the data happens to be, or needs to be, either on-prem or in a public cloud. The intersection of Anthos, data, and analytics is but an emerging topic, being by far the most nascent in Anthos trajectory towards enabling modern applications. As you read this chapter, a tremendous amount of work continues across many areas related to how Anthos can enable the next generation of many data services and how analytics can seamlessly work in hybrid and multi-cloud, including: Data backup and disaster recovery, from on-prem to cloud and among clouds; Databases, where the compute engine can be container native and the storage can be software defined; Federated Learning, where the training of AI models gets orchestrated across several locations, processing the data where it resides, and centrally aggregating only the intermediate results of such computations; and AI and ML Pipelines that allow the industrialization of models in a production environment, leaving the experimentation phase. As you can see from the introduction, data on Anthos opens up a number of additional workloads that require data, across various multi and hybrid cloud solutions - containers are no longer limited to only stateless workloads. But, before diving deeper into using data on Kubernetes and Anthos, we need to understand the differences between portability and mobility, which we will discuss in the next section.

D.1 Portability Versus Mobility When it comes to Anthos and data, the nuances distinguishing portability and mobility become fundamental to understanding the challenges behind

processing data scattered across a hybrid and multi-cloud environment. Portability of applications enables Anthos to realize the “write once, run anywhere” premise, while mobility of applications enables the extension of that to “write once, run anywhere, move anywhere else”. Portability matters to software developers during design time, and to administrators during operations, as it reduces the types of environment that may need to be provisioned. Mobility matters most to operators during runtime as it gives them the ability to re-distribute workloads dynamically, adjusting in real-time to demand. Mobility, in many ways, targets avoiding locking-in the workload to the specifics of the configuration of where it has been running and has been rooted in the adoption of open standards. Together, portability and mobility makes industrial scale operations more viable and extensible. Nevertheless, the challenges related to portability and to mobility of data can be far more intricate than the ones related to portability and to mobility of computation. While increased portability expands the surface of hardware where stateful containers can be deployed at, it may not equate directly to the mobility of stateful workloads from one location to another. In some cases, a stateful workload, let us call it SW, running on a location L1 and accessing a persistent storage PS may not be able to move to a location L2. For example, SW may not be able to access PS when running on L2 and the data itself may not have the mobility to be transferred to a physical storage accessible from location L2. Data often exhibits an element of gravity, imposing movement challenges due to many factors including: the sheer size of the data, financial and operational costs of moving the data, bandwidth constraints, security requirements, regulatory compliance and privacy constraints. Furthermore, in many cases, data is being created on a continuous basis, where applications need to ingest and process an ever-growing dataset, often within a timesensitive window. In these scenarios, a one-time copy or a set of discrete data copies may not meet the application requirements. On Portability Portability refers to the ability to execute on several locations. In many ways, portability preceded any concerns around mobility as it related to workload placement. Before a workload could run on a location, the workload should

be compatible with the underlying architecture and there should be enough resources available to run the workload. In order to increase portability, development environments seek to minimize the dependencies on specific infrastructure configurations and libraries. Portability can be viewed, however, as a requirement for effective mobility. A computing workload running at a location should only be moved to other locations where it can continue executing. In essence, the more locations an application can be made portable to, the more locations it can be moved to. The performance of the execution may be different among these locations, because each location may have different computing, memory, network and storage configurations but, nonetheless, the application can actually execute. Containers increase portability by packaging the application code together with all its library dependencies, minimizing requirements around specific library configurations that needed to be available at a location before it could run there. For stateful containers, portability demands that the way containers interact with storage devices in one location are compatible with the way containers interact with storage devices in another location. On Mobility In the context of computer science, mobility refers to the ability to transfer the location of a resource from one physical place to another. For a computational workload, it refers to the transfer of the execution of the workload from a set of physical compute resources, such as a server node, to another. For data, it refers to moving the content of the data from one physical storage device to another. Virtual Machines (VMs) made the ability to move computing across physical boundaries a viable approach, often in real-time and with minimal, if any, service disruptions. The same mobility has not been granted when it comes to the data accessed by these VMs. Moving the data across networks can be subject to not only latency, bandwidth constraints, and costs of reading from its location and writing the data into its new location, but also to challenges associated with changing the address of the data itself mid-flight. Applications typically bind addresses to physical locations at the beginning

of the execution, and these bindings cannot be easily changed. Furthermore, depending on the size of the data that requires moving, bandwidth and read/write throughput constraints cannot be overlooked. As size increases, data acquires more gravity and loses mobility. As a result, VMs that required access to data residing in persistent storage became bound to moving only to other physical computing resources where the VM could still have access to the storage where the data was initially stored. In essence, the computing workload moved but the data did not. Containers added a level of flexibility to mobility when it made it viable and easier to move workloads by simply starting new containers in the newly desired destinations and then terminating the old ones, as opposed to actually moving a running workload. Nevertheless, similar constraints applied to containers that accessed persistent storage, referred to as stateful containers. As long as the new location of a stateful container had access to the same persistent storage that the old container was using in its prior location, mobility was easier. An interesting aspect emerges with containers however. Because the container is re-starting from scratch, new bindings between the container and the physical storage can be made. As a result, containers can also be moved to locations where replicas of the data exist, as long as the replicas are an exact mirror of each other.

D.1.1 Chapter Organization This chapter provides a glimpse into the topic of Anthos, data and analytics, and it is organized as follows: Section Kubernetes and Storage reviews some important concepts on how Kubernetes interfaces with the storage layer, forming a foundation for the next sections; Section Anthos and Storage discusses how Anthos adds value on top of Kubernetes to enhance the portability of workloads that access permanent storage; Section BigQuery Omni Powered by Anthos explains how Anthos has enabled BigQuery, a Google Cloud data warehouse analytics engine, to perform queries on data that resides on other public cloud platforms;

Section Anthos Hybrid AI explains how Anthos has enabled Google Cloud analytics solutions to run on-prem, allowing cloud-born analytics to be deployed and executed without requiring the data to be moved to the cloud; and Section Summary highlights the main topics covered in this chapter.

D.2 Kubernetes and Storage One of the fundamental questions about container-based and Cloud Native Applications (CNAs) is how they support persistent data. These questions range from how they support something as simple as block or file operations within a cluster, to how they interact with transactional databases and object stores of petabyte scale. Since Anthos is a management plane on top of Kubernetes, it is paramount to first understand how Kubernetes exposes the configuration and management of block and file storage, and then understand not only what Anthos adds today on top of Kubernetes storage, but also what Anthos can add in the future. In Kubernetes, a volume represents a unit of storage allocation and the containers that interact directly with storage systems are referred to as stateful containers. Figure D.1. Data and Control Planes shows the technology stack that enables stateful containers to interact with storage systems. Data resides at the persistent physical storage layer, the lowest in the stack, which includes, for example, Internet Small Computers Systems Interface (iSCSI) Logical Unit Numbers (LUNs), Network File Systems (NFS) shares and cloud offerings for object, block and file. On top of the physical layer, at the Operating System (OS) level, there are the block and file systems, which can be local or networked. These systems implement drivers, often operating at the OS kernel level, that allow upper layers to interact with the physical storage using either a block level or a file level abstraction. In their most basic form, containers interact with storage systems across two planes: Data Plane: used by stateful containers to perform read and write operations. To a great extent, the data plane has been standardized. For example, the Portable Operating System Interface (POSIX) is the defacto standard protocol interface for reading and writing data into file and block storage.

Control Plane: used by Kubernetes volume plugins to allocate, deallocate and manage space in the physical storage. Unlike the data plane, the control plane interface had not been standardized until very recently. In 2017, an effort across container orchestrators involving Kubernetes, Mesos, Docker, Cloud Foundry and a few others, led to the Container Storage Interface (CSI) specification. A storage vendor implements the CSI specification into a driver using its specific APIs and storage protocols. A vendor’s CSI driver is then packaged and deployed in the Kubernetes cluster as Pods, typically separated into a controller deployment and a per-node deployment. Kubernetes calls the appropriate CSI APIs implemented by these drivers when orchestrating volumes for container workloads. CSI will be further explained in the following sessions. Figure D.1 Data and Control Planes

Data, in the context of this chapter, refers exclusively to data that persists the execution of applications. As a consequence, not only does the data demand

persistent storage, but it also needs to be addressed and accessed after the application that created or changed it has terminated, either normally or abnormally. These applications can range from very ephemeral ones, such as a camera application in a mobile phone that records a video stream and terminates, to a far more resilient database application that can, for example, form the backend of a digital banking service.

D.2.1 On the Emergence of a Standard Kubernetes volume plugins introduced an abstraction for vendors to add support for their block and file storage systems in Kubernetes, automating the provisioning, attaching and mounting of storage for containers. Furthermore, Kubernetes abstractions of Persistent Volumes (PVs), Persistent Volumes Claims (PVCs) and StorageClass objects enabled portability of storage and stateful containers. But other challenges remained. In early 2017, Kubernetes was not yet the most adopted container orchestration platform. Mesos, Docker Swarm and others had considerable market share. Each of these orchestration systems had a different way to expose volumes into their own containers. In order to address the container market, storage vendors needed to develop multiple storage plugins, one for each of the container orchestration systems, and many decided to delay development until a standard was created or a market leader emerged. At the same time, the Kubernetes volume plugin mechanism was facing several challenges. Volume plugins had to be implemented in core Kubernetes as an in-tree volume plugin. Storage vendors had to implement plugins in the programming language Go, contribute source code to the Kubernetes code repository, and release and maintain the code as part of the main Kubernetes releases. The in-tree volume plugins introduced different challenges to different communities, as outlined below: For the Kubernetes development community: it became daunting to have volume plugins embedded right into the repository, as part of the Kubernetes codebase, being compiled and implemented within the

Kubernetes binary code. The tight coupling introduced challenges, including: Security: volume plugins had full privilege to access all Kubernetes components, the entire source tree became exposed to exploitation of security vulnerabilities,; Stability: bugs in volume plugins could affect critical Kubernetes components, causing component failure or failure of the entire Kubernetes system; Quality and Testing: testing and maintenance of external code became intertwined with CI/CD pipelines and it became increasingly challenging to test Kubernetes code against all supported storage configurations before each release. Kubernetes developers often lacked access to laboratories that housed all the flavors of storage and associated expertise . For some plugins, Kubernetes would depend on users to find and report issues, before they could be addressed. For storage vendors: in-tree volume plugins slowed down, or even inhibited, the development and support of drivers due to challenges across several stages of the plugins lifecycle, including: During development: storage vendors were required to effectively extend Kubernetes code, and check the code into the Kubernetes core repository, requiring storage vendors to acquire expertise in the Go programming language, and Kubernetes. Forcing the use of a single programming language, Go, further limited the intellectual capital pool, since Go is an emerging language there are less developers available in the market than another language like Java or C; During release: storage vendors were required to align with the Kubernetes release cycle, limiting them to release new code and updates only when Kubernetes released code, four times a year. This tightly coupled release cycle severely impacted deployment velocity; During commercialization: the code for volume plugins was forced to be open sourced; For users: these dynamics affected the overall Kubernetes experience for stateful containers and prevented adoption at scale because: Weak ecosystem: as storage vendors struggled to develop plugins,

the number of storage choices were limited; Slow velocity: release cycles were long and testing quality poor, users had to wait long periods for issues to be addressed. The Kubernetes community, supported with strong Google leadership, decided to move away from an in-tree plugin model and create a stand alone approach for vendors to develop volume plugins. Kubernetes was not alone. All container orchestration systems faced similar challenges -- they needed to expose containers to as many different storage vendor products as possible, with as little work as possible, in the most secure manner. At the same time, storage vendors also wanted to expose their products to as many users as possible, regardless of which container orchestration system they were using, leveraging plugins across as many systems as possible, with little or no modification to the code. As a result, the Kubernetes community formed a coalition with other container orchestration communities and started working on a specification to create an extensible volume layer. The initial design focused entirely on the user experience and standard storage platform capabilities across vendors. In addition to creating a standard, Kubernetes also had a unique need to provision storage on demand, through dynamic storage provisioning. During runtime, applications should be able to request storage at any time and Kubernetes should be able to reach out to the backend system to obtain the storage needed and make it available to the application. No administrators should need to get involved, no disruptions to the execution of the application should occur and no exclusive pre-allocation of storage per application should be necessary. Once the application would no longer need the storage, it should be released back to the pool for re-allocation.

D.2.2 Container Storage Interface The Container Storage Interface (CSI) is an open source standard for exposing arbitrary block and file storage systems to containerized workloads, managed and orchestrated by container orchestration systems, like Kubernetes. The CSI specification is just that, a specification. The specification does not define how the plugin should be packaged, it does not demand any deployment mechanism, and it does not impose or include any

information on how it should be monitored. At its core, CSI provides an interface to execute the following operations: Create/delete volumes Attach/detach volume to a node Mount/unmount volume on a node for a workload Snapshot a volume and create a new volume from a snapshot Clone a volume Resize a volume The interfaces in CSI are implemented as three sets of APIs, listed below, and all APIs are idempotent, meaning that they can be invoked multiple times by the same client and they will yield the same results. These APIs all require a volume identifier which allows the receiver of the call to know that the call has been made before. Idempotency on the API brings predictability and robustness, allowing the system to handle recovery from failures, especially in the communication between components. Identity services: operations that give basic information about the plugin, such as its name, basic capabilities, and its current state. Node services: volume operations that need to run on the node where the volume will be used, such as mount and unmount. Controller services: volume operations that can execute on any node, such as volume attach and detach, and volume creation and volume deletion. gRPC Remote Procedure Call (gRPC) was selected as the wire protocol because it is language agnostic, has a large community of users and has a broad set of tools to help implementation. The APIs are synchronous, meaning that the caller will block after making the call, waiting until the results come along. The specification does not impose any packaging and deployment requirements. The only requirement is to provide gRPC endpoints over Unix sockets.

D.2.3 Differentiation Behind CSI The CSI standard emerged in an effort to address several goals, and has truly

become the foundation for creating a very extensible volume layer. CSI has created a storage ecosystem for Kubernetes where not only all major storage vendors support it, but end users can also write their own custom CSI drivers specific for their applications. In essence, CSI: Standardizes the storage control plane: fostering interoperability for storage vendors and container orchestrators. Storage vendors build a single driver that can be used by any orchestrator, and container orchestrators can interface with any driver; Supports dynamic storage allocation: eliminating the need for preprovisioning; Accelerates the expansion of the storage vendor ecosystem: increasing the number of options supported by Kubernetes; and Provides an out-of-source tree design: lowering the entry point for developers to implement volume drivers, eliminating risks with source code contributions and decoupling Kubernetes release cycles from storage vendors’ development cycles. The next section discusses how Anthos makes stateful containers more portable through Anthos Ready Storage and the adherence to CSI.

D.3 Anthos and Storage The abstraction of storage as a set of persistent volumes of different sizes, allocated and de-allocated dynamically, on demand, constitutes a core value of what Kubernetes delivers to end users. CSI provides an extensible southbound interface to a vendor’s storage system, shielding the heterogeneity across the storage device spectrum, delivering a consistent and uniform interface for Kubernetes, and all other container orchestration engines. At the same time, Kubernetes provides a standard northbound interface for Kubernetes users to interact with storage at a higher level of abstraction, in terms of PVs and PVCs, while CSI also allows developers to model disk with respect to disk encryption, snapshots and resizing. So far, all the CSI topics covered relate to Kubernetes, not Anthos. This raises one interesting question. What is the value added, if any, that Anthos brings to the storage space, in addition to what Kubernetes does? One easy,

perhaps even trivial answer, is that Anthos helps in the storage space by simply making the deployment and management of Kubernetes environments easier. By supporting CSI, Kubernetes is in fact expanding the realm of environments in which Anthos can deploy and manage Kubernetes clusters that support stateful containers. In reality, however, Anthos delivers a lot of value today and has the potential to deliver much more in the near future, to support the usage of storage in hybrid and multi-cloud environments and to expand portability and management of stateful containers. In Anthos, just like in Kubernetes, any pod that requires persistent storage constitutes a stateful pod. Persistent storage stores data beyond the lifetime of a pod and of any enclosure where it may be housed, such as its VM, its node in a cluster, or the cluster itself. In Anthos, the data will only cease to exist if the data itself is deleted by an entity that has the required Identity and Access Management (IAM) rights to do so Anthos allows stateful pods to be deployed in hybrid and multi-cloud environments, and pods to communicate with data services that may run anywhere across the infrastructure spectrum. As depicted in Figure D.2. Stateful Pods in Anthos, the dashed rectangles delineate different environments, from left to right: GCP cloud (in blue), private cloud onpremise (in red), and any public cloud other than GCP (in brown). In each of the cloud environments, there are stateful Anthos pods running, accessing their own local physical storage for reading or writing data. Figure D.2 Stateful Pods in Anthos

Anthos offers two approaches to expand the spectrum of infrastructure storage that can be utilized, as described in the following sections: Anthos Managed Storage Drivers Anthos Ready Storage Partner Program

D.3.1 Anthos Managed Storage Drivers Anthos installs, supports and maintains numerous storage drivers on each platform. These are the differentiating advantages of Anthos Managed Storage Drivers: Automation of driver selection: where Anthos selects the specific versions of the drivers that should be supported and managed on each node in a cluster. For example, Anthos deploys the GCE Persistent Disk and Filestore drivers on GCP, AWS EBS and EFS drivers on AWS, Azure Disk and Azure File drivers on Azure, and the vSphere driver on vSphere. Anthos clusters on bare-metal also bundle the sig-storagelocal-static-provisioner. Anthos effectively expands the surface of environments where Kubernetes clusters can be deployed and managed in an automated way; Management of drivers: as the number of drivers increases, Anthos aids in the integration and deployment of these drivers into a customer’s Kubernetes environment. As new nodes are created, Anthos ensures that storage is available and accessible by the new nodes, dynamically,

without user intervention. Anthos automates the decision of which driver version to use in specific scenarios. Anthos also has the ability to manage the entire driver lifecycle and provide support for them. The Google Cloud Marketplace contains a set of stateful engines that Anthos customers can use, such as Redis Enterprise.

D.3.2 Anthos Ready Storage Partner Program In addition to Anthos Managed Storage Drivers, users can bring other storage vendors of their choice. Anthos has created the Anthos Ready Storage program to work with selected third party storage partners to develop, test and qualify their CSI drivers with Anthos. This program ensures that the implementation of the CSI driver delivered by these qualifying storage vendor partners provides a seamless experience in Anthos. The latest list of qualified drivers can be found on the Anthos Ready Storage partners site. In order for a storage system to be validated as Anthos Ready Storage, it must go through the Anthos process for the qualification of CSI drivers. Anthos provides a quality assurance process to test drivers and check interoperability between Kubernetes, the drivers and underlying storage. This process ensures that third party CSI drivers are up to par with Anthos standards. By introducing standardization, CSI extended the ecosystem of storage vendors writing plugins and lowered the barrier to entry for developing them. As a result, an avalanche of new plugins and updates started emerging in the market that required testing and quality assurance. The storage vendor’s drivers must meet the following requirements: Dynamic storage provisioning of volumes and other Kubernetes native storage API functions required by customers; Use of Kubernetes for deploying a storage CSI driver and its dependencies; Management of storage for Kubernetes scale up and scale down scenarios; and Portability of stateful workloads that are using persistent storage.

D.3.3 Anthos Backup Services

With well-developed, stable CSI features, users can reliably deploy and run their stateful workloads like relational databases in managed Kubernetes clusters in Anthos. Stateful workloads typically have additional requirements over stateless ones, including the need for backup and storage management. Anthos provides a simple to use, cloud-native service, namely Backup for GKE, for users to protect, manage and restore their containerized applications and data running in Google Cloud GKE clusters. Two forms of data are captured in a backup: Cluster state backup or config backup: which consists of a set of Kubernetes resource descriptions extracted from the API server of the cluster undergoing backup; and Volume backups: which consists of a set of GCE Persistent Disk (PD) volume snapshots that correspond to PersistentVolumeClaim resources found in the config backup. Backup and restore operations in Backup for GKE are selective. When performing a backup, the user may select which workloads to backup, including the option to select “all workloads”. Likewise, when performing a restore, the user may select which workloads to restore, including the option to select “all workloads that were captured in this backup”. You can backup workloads from one cluster and restore them into another cluster. Restore operations involve a carefully orchestrated re-creation of Kubernetes resources in the target cluster. Once the resources are created, actual restoration of workload functionality is subject to the normal cluster reconciliation process. For example, Pods get scheduled to Nodes, and then start on those Nodes. During restoration, the user may also optionally apply a set of transformation/substitution rules to change specific parameters in specific resources before creating them in the target cluster. This combination of selective backup and selective restore with transformations is designed to enable and support a number of different backup and restore scenarios including, but not limited to: Backup all workloads in a cluster and restore them into a separately

prepared Disaster Recovery (DR) cluster; Backup all workloads, but selectively restore, i.e. rollback, a single workload in the source cluster; Backup the resources in one namespace and clone them into another one; Migrate or clone a workload from one cluster to another; and Change the storage parameters for a particular workload. For example, move the workload from a zonal Persistent Disk (PD) to a regional PD or change the storage provisioner from the PD in-tree driver to the CSI driver. It is worth noting at this point that Backup for GKE doesn’t backup GKE cluster configuration information, such as node configuration, node pools, initial cluster size and features enabled. Backup for GKE’s restore does not involve creating clusters either. It is up to the user to create a target cluster, if one doesn’t already exist, and install the Backup for GKE agent into that cluster before any restore operations may commence.

D.3.4 Looking Ahead Looking ahead, Anthos has the potential to make management and usage of storage in Kubernetes, across hybrid and multi-cloud environments, significantly simpler and more automated. Examples include: Easing the deployment of drivers: by creating blueprints and higher level abstractions to facilitate storage vendors in specifying the deployment of their CSI drivers; and Introducing a marketplace for drivers: by offering a one click deployment within Google Cloud, instead of requiring users to jump out of context to reach storage vendor sites to retrieve YAML files to deploy the drivers themselves.

D.4 BigQuery Omni Powered by Anthos BigQuery, a petabyte-scale data warehouse, has become a platform for users to access an ever increasing portfolio of Google’s analytics products, such as Auto-ML and Tables. The BigQuery team at Google wanted to extend these

same analytics capabilities to data that resides on other public clouds, without requiring customers to migrate or duplicate their data on Google Cloud. The ability to analyze data resident in other clouds, without the need to move the data, defines a new field referred to as multi-cloud analytics. BigQuery Omni extends BigQuery capabilities to analyze data to data residing in multiple clouds, be it using standard SQL or be it using all of BigQuery’s interactive user experience to perform ad-hoc queries on the data. At a high level, BigQuery consists of two completely decoupled components: compute and storage. The compute component consists of a very resilient and stateless query engine, known internally at Google as Dremel, capable of executing standard SQL like queries in a very efficient manner. The storage component consists of a highly scalable, optimally partitioned, densely compressed, columnar based data storage engine. These two components interface via a well defined set of APIs. This decoupling allows both components to scale independently. For use cases that require more data than analytics, the storage scales to accommodate the data size. For analytics intensive use cases, compute scales accordingly without requiring more storage to be allocated. The BigQuery Omni engineering team was faced with the challenge of operating these components outside of Google's internal "borg" infrastructure. They leverage the similarities of Kubernetes and "borg", along with Anthos's multi-cloud support to overcome these challenges. BigQuery Omni uses Anthos GKE as the management software to operate the Dremel query engine in the same region where your data resides in a public cloud. Dremel is deployed, runs and is operated on an Anthos GKE in a Google-managed account in the public cloud where the data is stored. With Anthos, Google Cloud has been able to launch BigQuery Omni which allows Google Cloud to extend its analytics capabilities to data residing on other clouds, such as AWS and Microsoft Azure. Please refer to this link for a demonstration of BigQuery Omni. It is important to note that, while BigQuery Omni leverages the power of Anthos GKE, a user is not obligated to be an Anthos customer. The BigQuery Omni engineering team is the Anthos customer in this case and BigQuery Omni users do not interact directly with Anthos GKE.

As depicted in Figure D.3. BigQuery Omni Technology Stack, BigQuery Omni introduces the ability to run BigQuery query engine, Dremel, both in GCP and on other public clouds, such as AWS and Microsoft Azure. The architecture components are distributed as follows. In GCP, the BigQuery technology stack remains unchanged. From bottom to top, it consists of the BigQuery storage that is connected via a Petabit network to the BigQuery compute clusters where Dremel runs. In order to temporarily store and cache intermediate query results, which can be of the order of terabytes for some complex analytics, Dremel implements a distributed memory shuffle tier. When running on a public cloud other than GCP, in AWS for example, rather than migrating the data from that cloud to BigQuery’s managed storage in Google, BigQuery Omni executes the analytics on the data resident in the cloud’s native storage system. In BigQuery Omni, when a user performs a query accessing multi-cloud data, BigQuery Routers use a secure connection to literally route the query to the Dremel engine running where the data resides. The Dremel workload must be given access to the data residing on the public cloud. Query results can be exported directly back to the data storage, with no cross-cloud movement of the query results or the data used for the query. BiqQuery Omni also supports loading files to “regular” BigQuery for users needing to join data resident in BigQuery or train ML models using VertexAI. Figure D.3 BigQuery Omni Technology Stack

D.4.1 Giving BigQuery Access to Data in AWS In order to enable BigQuery Omni to do a query on data residing on AWS, a user needs to have an AWS account and permission to modify the Identity and Access Management (IAM) policies of the S3 data to grant BigQuery access. There are several steps associated with the process, as outlined in the following sections. Creating an AWS IAM Role and Read Policy The user must first create an AWS IAM role and a read policy for the S3 the user wants to give BigQuery access to, and then attach the role and the police together. Please refer to the Google Cloud BigQuery Omni documentation or the AWS documentation for the latest instructions on how to perform these tasks. Define an External Table

After creating the connection, the user needs to create an External Table definition for their data stored in S3. Please refer to the Google Cloud BigQuery Omni documentation or the AWS documentation for the latest instructions on how to perform these tasks.

D.4.2 Capitalizing on BigQuery’s Storage Design Optimization BigQuery’s scalability and high performance derive from a combination of design choices made that compound on each other. For example, in order to optimize for query analytics, BigQuery storage engine stores data in a columnar format, and partitions and clusters the data to reduce the search space of the query. If the remote data is not stored and organized in the same manner, this may lead to some degradations in performance relative to running BigQuery on Google Cloud. However, the benefits of being able to use BigQuery analytics to unlock the value of the data, in its existing format and in its existing location, without incurring any migration effort may be an acceptable trade off. Users always have the option to optimize the data for BigQuery Omni analytics by emulating some of the technology approaches performed by BigQuery, including: Conversion of the data into a columnar format: enabling the data to be retrieved column by column. In most use cases, analytics look at all the values in columns as opposed to doing a range scan of rows in a table. As a result, columnar formats such as Apache Parquet optimize for retrieval of columns, reducing the amount of data accessed during the analytics, and consequently, increasing performance and reducing costs. Compression of the data: enabling less bytes to be retrieved from storage. For data stored in columnar formats, compression can capitalize on the fact that columnar entries can be very repetitive or follow a pattern, for example, of consecutive numbers. Apache Parquet uses Google Snappy to compress columnar data. Partitioning of the data: enabling a more efficient data scan, when searching for specific data values. BigQuery uses hive partitioning.

D.4.3 Differentiation Behind BigQuery Omni

BigQuery Omni differentiates itself from other distributed query engines such as Presto, in how it can, over time, leverage Anthos as a management platform to not only deploy, but also monitor, operate, update and administer the entire lifecycle of the query. In essence, BigQuery wanted a multi-cloud story for customers. BigQuery needed a Google “borg” like multi-cloud solution so that Dremel would work on AWS and Microsoft Azure almost exactly like it does on Google Cloud. BigQuery query engine also runs on a GKE environment, capitalizing all modern features of a cloud native deployment. It is important to note that, perhaps historically for the first time, a cloud vendor, in this case Google, has deployed truly proprietary and differentiated code, Dremel, within the boundaries of a cloud competitor. One can argue that Google had already run GKE on AWS and MS Azure, and this too is proprietary code. But the key differentiation in this case is that GKE is a platform for managing containers, a capability that, today, all cloud providers possess, while Dremel holds the key to analytics at scale, driving significant value to Google users. Ultimately, BigQuery Omni brings to the world of multicloud a consistent user experience, with a single pane of glass, for data analytics residing in multiple clouds, tearing down silos that have been previously cemented around cloud borders. With Anthos, BigQuery Omni leverages it as a secure multi-cloud infrastructure platform for the foundation to provide analytics on data residing in other clouds. And BigQuery does all that without any crosscloud data movement and without creating any copies of the data.

D.5 Anthos Hybrid AI Google is at the forefront of the AI revolution, and infuses its cutting-edge AI technology in many of its products giving its users capabilities such as autocomplete while typing on any Google product. Google Cloud AI makes available to users many of these industry-leading AI technologies that have been ideated, incubated, implemented and continuously improved upon by Google Cloud, Google and the broader Alphabet family of companies. Google continues to attract and nurture top AI talent. Recently, the Turing

Award, considered the Nobel Prize in computing, has been awarded to two googlers for their contributions in the field. In 2017, David Patterson received the award for helping to develop an approach for faster, lower power microprocessors, now used in 99% of all microprocessors in smartphones, tablets and many Internet of Things (IoT) devices. In 2018, Geoffrey Hinton received the award for laying the foundation of deep neural networks which have spurred major advances in computer vision, speech recognition and natural language understanding. Up until now, however, customers could only benefit from these technological advancements by moving their data to Google Cloud, a requirement that many enterprises could not meet due to several constraints including security, privacy, and regulatory compliance. Furthermore, even if some portion of the data could be moved to cloud, any learnings in terms of software development could not be brought back to the remaining on-premise data because the same tools and APIs were not available on-prem. With Anthos, Google Cloud has engaged in a strategy to enable users to benefit from Google’s advancements in AI without requiring the data to move to Google Cloud. Similar to BigQuery Omni where Anthos has enabled the deployment of Google’s BigQuery query engine, Dremel, on other public clouds, Google Cloud Hybrid AI leverages Anthos to deploy and run Google’s AI services on-prem, close to where the data is.

D.5.1 Hybrid AI Architecture As depicted in Figure D.4. Hybrid AI Technology Stack, Hybrid AI enables the same AI models, that have been trained by Google, to be deployed and to run on-prem, making the same set of AI APIs used in the cloud accessible onprem as well, where it can reach the local datasets. Figure D.4 Hybrid AI Technology Stack

Hybrid AI starts with the premise that the user has one or more Anthos onprem clusters already registered with GCP. Using GCP marketplace, the user can search for and discover Google Cloud AI solutions, such as a solution for converting speech to text, that will be explained in the following section. The user can select a solution and click-through deploy on an on-prem Anthos cluster. These AI solutions have been packaged as an Anthos App that can be deployed on any Anthos clusters, including those running on-prem. These solutions are fully managed by Anthos, and benefit from all Anthos capabilities, including auto-scaling which guarantees that on-prem compute resources are only allocated while in use, and only what is needed. Anthos meters these AI solutions and bills customers based only on usage. Anthos Hybrid AI allows AI applications to be deployed, run and be un-deployed

dynamically, on-demand.

D.5.2 Hybrid AI Solutions The first release of Anthos Hybrid AI makes available the following Google Cloud AI services on-prem: Optical Character Recognition (OCR): which converts documents into a text in digital format; and Speech-to-Text: which converts voice into text with support including English, French, Spanish, Cantonese and Japanese. The Google Cloud version already supports more than one hundred different languages and variations allowing for easy inclusion of additional language to the onprem version, driven by demand. The following sections discuss each of these use cases in greater detail. Optical Character Recognition Enterprises have a tremendous amount of data being held hostage in nondigitized media, such as paper documents, or hidden inside digital formats that do not allow them to be easily analyzed digitally. For example, a scanned digital image of an invoice or of a government issued title may function as a record of a transaction and a proof of ownership, respectively, but may still remain very challenging to extract the individual pieces of information that can be analyzed separately. The names of the entities involved in the deal, date, location, value, and specifics about the items exchanged may remain buried into the digital image. Previous efforts to digitize this information manually with humans entering the data have proven themselves to be errorprone, time consuming, cost prohibitive and hard to scale. Extracting the information from these documents is a prerequisite to fully analyzing the data. Once the data being held by contracts, patents, forms, invoices, tickets stubs, PDFs, images and documents on many other forms can be decomposed into individual digital components, they can be automatically analyzed at scale, and may unlock unprecedented insight. This data may allow enterprises, for example, to better understand their customers,

their products and services, their financial operations, their internal processes, their employees performance and many other aspects of their businesses. Use cases touch many industries including: Pharmaceutical: for analyzing drug development data, from laboratory experimentation to clinical trial, where data is captured across a diverse set of data sources; Legal: for doing discovery of information; Insurance: for processing claims which may include analyzing images, and processing hand-written forms; and Asset management: for analyzing warranty, and maintenance data. Optical Character Recognition (OCR) consists of extracting the individual pieces of information residing in documents and converting them to digital format in an automated manner. In recent years, AI has brought a significant advancement in the automation of this process, relying heavily on Convolution Neural Network (CNN) models that can now be trained on massive amounts of data and bring tremendous accuracy to the process. tbd@ find some concrete examples and data sizes, and accuracy increase to add here. While public clouds hold the majority of advancements in OCR technologies and offer them in the form of managed services, the content of these documents had to be scanned and moved to the cloud for processing. In many cases, the information residing in these documents is very sensitive and often holds Personally Identifiable Information (PII). As a result, enterprises were reluctant to absorb the risk of transferring any of this data to the cloud. Hybrid AI enables organizations of all sizes to unlock the value of their onprem data. Hybrid AI does not require to move the data to the cloud and, instead, moves cutting-edge analytics to where the data is. In essence, leveraging cloud AI technologies without being forced to migrate and host the data in the cloud. Speech-to-Text Speech, be it in the form of pure audio recording or be it as the audio stream in videos or live conversations, comprises yet another format in which

valuable information remains trapped. A common approach to speech recognition consists of first transforming the speech into text, and then using Natural Language Processing (NLP) to gain insight into the semantics of the speech. Today, enterprises possess speech data that captures human-tohuman interactions and human-to-machine interactions. For example, customer calls to contact centers may consist of an initial part completely assisted by machines, with the purpose of gathering data about the issue and more effectively routing the call to an available human agent, and a second part where the customer communicates with a human agent to solve the issue. Similar to OCR, the data to be converted from speech to text may require data to remain on-prem. These recordings may not only contain PII and sensitive information, but they may hold Intellectual Property (IP) on many aspects of the business, such as executive board level discussions on strategic directions, product launch, technologic innovations, and human resource related decisions. Another challenge with speech-to-text is the vocabulary used in communications. Enterprises often differ on the context in which they are talking, which tend to be specific to their industry, to their product lines, and code names used internally. As a result, generic models need to be customized to be more contextual and relevant to each business.

D.5.3 Differentiation Behind Hybrid AI Hybrid AI introduces a unique approach to capitalizing on best-of-breed solutions and technological advancements of Google Cloud AI solutions, where the data residency is preserved, and the AI code that has been previously designed, developed, trained and tested meets the data where the data is, on-prem, as opposed to forcing the data to move to the cloud. By using AI models developed by Google, enterprises benefit from models that have been optimized to require less computing resources to run, that have been designed and trained to deliver more accurate results, and that have been architected to be smaller, consuming less resources. By using Anthos, Hybrid AI brings a slew of advantages to how these AI workloads run on-prem, reducing their time to production and enabling the

use of modern practices to manage their entire lifecycle. These AI workloads run on GKE clusters on-prem, managed by Anthos and, as a result, they benefit from all Anthos capabilities, including: One click deployment: where AI applications can be started or terminated with one push of a button, and all underlying resources are dynamically allocated or unallocated, accordingly. In the case of AI workloads, this may include the usage of accelerators, such as GPUs, allowing organizations to leverage any pre-existing infrastructure and to increase their utilization; Auto-scaling: where the clusters running the AI applications automatically scale out and in, as models are trained, customized or used for inferencing. This tremendously reduces operational efforts associated with AI workloads; A/B testing: where the performance of new AI models can be easily compared and contrasted; Canary deployment: where phased deployment of new models can greatly mitigate any errors introduced in the customization or re-training of models; Policy management: where AI engineers can manage the deployment and operation of these models through declarative policies; Life-cycle management: where AI models can benefit from version control and upgrade the deployment of models with one-click; Metrics logging and monitoring: where AI models gain all aspects of observability, with all logs sent to a centralized location. The status can be automatically monitored and the AI models can be managed using the same frameworks and application performance tools used for other applications; Usage based metering: where AI models are billed based only on how much it is used; Service Mesh: where pre-built Istio objects can be leveraged to scale to thousands of connections; Marketplace: where AI models can be easily searched for and discovered; Consistent Continuous Integration/Continuous Development (CI/CD) processes: where the same workflows can be used on-prem and in the cloud; and

Single Pane of Glass: where users have the same experience regardless of where their AI workloads are running. Hybrid AI also brings a consistent and cohesive user experience for all running AI workloads on the cloud and on-prem, allowing users to learn a single set of tools to manage the entire lifecycle of AI applications. In addition to that, having AI workloads running on a GKE environment, allows these workloads to automatically capitalize on all modern features of a cloud native deployment. Hybrid AI marks only the beginning of a journey where organizations of all sizes can apply AI tools to their data on-prem, to not only customize and deploy models, but to also use notebooks for development of new models, tools for managing the entire lifecycle of AI applications, and pipelines for managing the data ingestion, transformation and storage. Anthos can also be used to enable several types of edge based processing, such as using vision recognition for image or object classification. In the future, Anthos will also support third party AI models to be deployed on-prem in the same manner. In which case, Anthos may also perform binary authorization and vulnerability scanning of third party software.

D.6 Summary Anthos and GCP provides portability and mobility to developers, persistent storage to accommodate workloads beyond stateless applications, and options for analytics Anthos aims at delivering rigour to data workloads, becoming a single layer of control, security, observability and communication between components Anthos manages first party CSI drivers on each platform and the Anthos Ready Storage program qualifies third party drivers from industryleading storage partners. Anthos isolates stateful applications from the heterogeneity of the underlying hardware and makes stateful containers more portable Anthos supports a wide selection of storage systems, both first and third party, meeting the user where they are and allowing them to leverage their existing storage systems

An understanding of portability versus mobility - Portability refers to the ability to execute on several locations while mobility refers to the ability to transfer the location of a resource from one physical place to another. Kubernetes, and in turn, Anthos, supports storage using the Container Storage Interface (CSI). BigQuery Omni can execute on a GCP, GKE on AWS or on Microsoft Azure. This allows BigQuery query engine to access the customer’s data residing on other public clouds without requiring any data to be moved. Anthos Hybrid AI enables the deployment and the execution of Google trained models on-prem, meeting data residency requirements. Optical Character Recognition (OCR) decomposes documents into a digital format where individual components can be analyzed Speech-to-text converts audio recordings into text, where Natural Language Processing (NLP) can be used to understand the semantics of the speech

Appendix E An end-to-end example of ML application This chapter covers Why do ML in the cloud? When to do ML in the cloud? How to build an ML pipeline on Anthos using Kubeflow? Understand TensorFlow Extended Learn the features of Vertex AI In the preceding sections, you were introduced to Anthos and how to migrate your existing applications to the Anthos platform. This chapter will demonstrate how to run end-to-end machine learning workloads on multiple cloud providers and on-prem. A fully working and production-ready project will be discussed in deep detail. We will be using Kubeflow on the Anthos platform. Specifically, the chapter will introduce you to the need for automation in the ML pipeline, the concept of MLOps, TensorFlow extended, and Kubeflow. We will learn how Kubeflow can be used on-prem and cloud to automate the ML pipeline, with a specific example of hand digit recognition. Finally, we will explore Vertex AI- a one-stop shop for complete MLOPs.

E.1 The need for MLOps Cloud Computing has democratized the machine learning world. Computational resources like GPU/TPUs are no longer limited to big institutes or organizations; the cloud has made them accessible to the masses. The Google Maps on your mobile or Google Translate you use on the go; both use machine learning algorithms running on the cloud. Whether you are a big tech company or a small business setup, shifting your ML tasks to the cloud allows you to leverage the elasticity and scalability offered by the

cloud. Your system resources no longer constrain you; furthermore, you benefit from the proprietary tools provided by the cloud service providers. As an AI/ML scientist/engineer, I can hear you saying that the cloud, etc. is good, but shifting to the cloud is cumbersome. Your views are not lopsided either; many of us have struggled to deploy our AI/ML model to the web. The journey from AI research to production is long and lengthy, and full of many hurdles. The complete AI/ML workload starting from model building to model deployment to allocate web resources is cumbrous - as any change in one step leads to changes in another. As shown in Figure E.1, only a small fraction of the real-world ML system is concerned with learning and prediction; however, it requires the support of a vast and complex infrastructure. The problem is aggravated by the fact that changing anything changes everything (CACE), a minor tweak in hyperparameters, changing learning settings, modifying data selection methods - the whole system needs change. Figure E.1 Different Components of an ML system[209]

In the IT sector, speed, reliability, and access to information are critical

components for success. No matter which sector your company is working in, it requires IT agility. This becomes even more important when we talk about AI/ML-based solutions and products. Today most industries perform the ML task manually, resulting in enormous time gaps between building an ML model and its deployment (Figure E.2). The data collected is prepared and processed (normalization, feature engineering, etc.) so that it can be fed to the model. The model is trained and then evaluated over various metrics and techniques; once the model satisfies the requirements, it is sent to the model registry, where it is containerized for serving. Each step from data analysis to model serving is performed manually, and the transition from one step to another is also manual. The data scientist works separately from the Ops team; they hand over a trained model to the development team, who then deploy the model in their API infrastructure. This can result in training-serving skew[210] - the difference between the model performance during training and performance during serving. Figure E.2 Machine Learning Workflow[211]

Further, since the model development is separate from its final deployment, there are infrequent release iterations. Furthermore, the greatest setback is the lack of active performance monitoring. The prediction service does not track or maintain a log of the model predictions necessary to detect any model performance degradation or drift in its behavior. Theoretically, this manual process might be sufficient if the model is rarely changed or trained. However, in practice, models often fail when they are deployed in the real world[212]. The reasons for the failure are multifold: Models get outdated: With time, the accuracy of the model drops, in the classical ML pipeline, there is no continuous monitoring to detect the fall in model performance and rectify it. The end-user, however, bears the pain. Imagine you are providing services to a fashion house suggesting new apparel designs based on customers' past purchases and fashion trends. Fashion changes dramatically with time; the colors that were ‘in’ in Autumn are no longer working in winters. If your model is not ingesting the recent fashion data and using it to give customers recommendations- the customers will complain- users to the site will drop, after some delay, the business team will notice and then on identifying the problem, you will be asked to update the model to latest data. This situation can be avoided if there is an option of continuous monitoring of the model performance and there are systems in place to implement continuous training (figure E.3) on newly acquired data. Figure E.3 Continuous Training

Data Drift: The difference between the joint distribution of input features and output in the training dataset and test dataset can cause dataset drift [2]. When the model was deployed, the real-world data had the same distribution as the training dataset, but with time the distribution changed. Consider you build a model to detect network intrusion based on the data available at that time. Six months have passed, do you think it will work as efficiently as it did at the time of deployment? It may, but chances are it would be fast drifting away, in the internet world - six months is almost six generations! The problem can be resolved if there are options to get metrics sliced on recent data. Feedback loops: There may exist un-intentional feedback, where the predictions by the model end up affecting its own training data. For example, considering you are working for a music streaming company,

the company uses a recommendation system that recommends users new music albums based on their past listening history and profile. The system recommends the albums with, let us say, more than 70% confidence level. The company decided to add a feature for users to like or dislike music albums. Initially, you will be jubilant, as the recommended albums get more and more likes, but as time goes by, the viewing history will affect that model prediction, and unknowingly the system will be recommending more and more music similar to the ones they had heard before, and leave out the new music to which the users might have enjoyed listening. To mitigate this problem, continuous monitoring of the system metrics will be helpful. To know more about the technical debt incurred by machine learning models, I would suggest readers go through the paper titled "Machine learning: The high interest credit card of technical debt” by Sculley et al. In the paper, they talk in detail about the technical debt in machine learning, they talk about the maintenance cost associated with the systems using AI/ML solutions. Though it is impossible and even unnecessary to obliterate the technical debt, a holistic approach can reduce the debt. What is needed is a system that allows one to integrate the standard DevOps pipelines with our machine learning workflows- the ML pipeline automation: “the MLOps”. Let us see how Anthos can facilitate MLOps. Run AI/ML applications across hybrid and multi-cloud environments: Anthos, a managed application platform, allows you to conveniently and efficiently manage your entire AI/ML product lifecycle by managing the on-prem and on-cloud infrastructure, and security of data and model. Traditionally an AI engineer develops machine learning code in different environments, with different clusters, dependencies, and even infrastructure needs (for example, training is compute-intensive and typically requires GPUs). Once the model is fully trained, the development team takes it to the next stage- the infrastructure at the deployment and production stage are very different (deployment can take place on CPUs). The infrastructure abstraction offered by Anthos provides much-needed portability; it allows one to build and run AI/Ml applications efficiently and securely. Anthos's truly hybrid cloud architecture lets you build and deploy your code anywhere without making any changes. With Anthos hybrid architecture, you can

develop and run some code blocks on-prem while others on the cloud. Anthos gives you the flexibility to build, test, and run your AI/ML applications across hybrid and multi-cloud environments. Use Anthos GKE to manage CPU/GPU clusters: Another advantage of using Anthos for your AI/ML workflow is the GPU support provided by Anthos. In collaboration with NVIDIA[213], the world’s number one GPU manufacturer, Google’s Anthos uses NVIDIA GPU operators to deploy GPU drivers required to enable GPUs on Kubernetes. This provides users with a broad choice of GPUs like V100, T4, and P4 GPUs. With Anthos, you can thus manage your existing GPUs on-prem and even support any future GPU investment you make. It is also possible to shift your workloads into the cluster, in case you require more compute resources. Thus, using Anthos GKE you can with ease manage GPU/CPU clusters both in-house and on-cloud. Secure data and model with ASM: Security of both data and model is paramount. Anthos ASM allows you to configure security access for your entire working environment. The chapter on Anthos Service mesh covers in detail how it can be used to provide a resilient, scalable, secure and manageable service. Deploy AI/ML using Cloud Run: Lastly one can directly deploy the trained dockerized model on cloud run, Google's serverless container as a service platform. In the coming sections, we will see how with Anthos and leveraging GCP tools like CloudRun, TensorFlow Extend, and operation orchestration tools like Kubeflow, and vertex AI, we can solve the core MLOps issues like portability, reproducibility, composability, agility, versioning, and build production-ready AI/ML solutions. Let us first start with understanding what we exactly mean by full ML pipeline automation.

E.2 ML pipeline Automation In the previous section (Figure E.2), we elaborated on the steps involved in delivering an ML project from inception to production. Each of these steps can be completed manually or via an automatic pipeline. In this section, we will see how each of these steps can be automated. The level of automation of these steps decides the time gap between training new models and their

deployment and can help fix the challenges we discussed in the previous section. The automated ML pipeline should be able to: Allow different teams involved in the product development to work independently. Ideally, many teams are involved in an AI/Ml workflow, starting from data collection, data ingestion, model development to model deployment. As discussed in the introduction section, any change by one of the teams affects all the rest (CACE). An ideal ML pipeline automation should allow the teams to work independently on various components without any interference from others. Actively monitor the model in production. Building the model is not the real challenge. The real challenge resides in maintaining the model’s accuracy in production. This is possible if the model in production is actively monitored- logs are maintained and triggers generated if model performance goes below a threshold. This will allow you to detect any degradation in the performance. This can be done by performing an online model validation step. Accommodating data drift, it should evolve with new data patterns that emerge when new data comes in. This can be accomplished by adding an automated data validation step in the production pipeline. Any skew in data schema (missing features or unexpected values for the features) should trigger the data science team to investigate, while any substantial change in the statistical properties of data should set a trigger for retraining the model. In the field of AI/ML, new model architectures come every week, and you may be interested in experimenting with the latest model or tweaking your hyperparameters. The automated pipeline should allow for continuous training (CT). CT also becomes necessary when the production model falls below its performance threshold or a substantial data drift is observed. Additionally, reproducibility is a big problem in AI, so much so that NeurIPS, the premiere AI conference, has established a reproducibility chair[214]. The aim is for researchers to submit a reproducibility checklist to enable others to reproduce the results. Using modularized components not only allows teams to work independently but also makes changes without impacting other teams. It allows the teams to narrow down issues to a given component and thus helps in reproducibility.

And finally, for an expeditious and dependable update at the production level, there should be a robust CI/CD system. Delivering AI/ML solutions rapidly, reliably, and securely can help enhance your organization's performance. Before you serve your model to the live traffic, you may also want to do A/B testing; you can do so by configuring such that the new model serves 10-20% of the live traffic. If the new model performs better than the old model, you can serve all the traffic to it; otherwise, roll back to the old model. In essence, we need MLOps - Machine learning (ML) with DevOps (Ops) - An integrated engineering solution that unifies ML system development and ML system operation. This will allow the data scientists to explore various model architectures, experiment with feature engineering techniques, and hyperparameters and push the changes automatically to the deployment stage. Figure E.4 below shows different stages of the ML CI/CD automation pipeline. We can see the complete automation pipeline contains six stages: 1. Development/Experimentation: In this stage, the data scientist iteratively tries various ML algorithms and architectures. Once satisfied, s/he pushes a source code of the model to the source code repository. 2. Pipeline continuous integration: This stage involves building the source code, identifying and outputting the packages, executables, and artifacts needed to be deployed in a later stage. 3. Pipeline continuous delivery: The artifacts produced in stage 2 are deployed to the target environment. 4. Continuous training: Depending upon the triggers set, a trained model is pushed to the model registry at this stage. 5. Model continuous delivery: At this stage, we get a deployed model prediction service. 6. Monitoring: In this stage, the model performance statistics are collected and used to set triggers to execute the pipeline or execute a new experiment cycle. Figure E.4 Stages of the automated ML pipeline with CI/CD

In the coming sections, we will cover some GCP tools that you can use to implement MLOps. We will talk about Cloud Run, TensorFlow Extend, and Kubeflow. The focus of the chapter will be Kubeflow and Vertex AI. Before we delve into the chapter, we should refer to a few important concepts: Cloud Run, which has been covered in another chapter. As you already know, Cloud Run is Google's serverless container as a service platform. Cloud Run allows one to run an entire application in a container. Cloud Run can be used to deploy any stateless HTTP container. You just need to specify a Docker file with all the dependencies and your ML prediction code that you want to run, package them up as a container, and boom, the service is deployed on the cloud. Recently[215] Google extended Cloud run capabilities to include end-to-end HTTP/2 connections, WebSockets compatibility, and bidirectional gRPC streaming. Thus, now you can deploy and run a wide variety of web services using Cloud Run. While Cloud Run is scalable, resilient, and

offers straightforward AI/ML apps deployment, it has some constraints. For example, the maximum number of vCPUs[216] that can be requested is limited to 4 (The option to increase upto 8vCPUs was available as preview at the time of writing this book). Integrating Cloud Run with Cloud Build you can automate the whole process and quickly implement CI/CD for your AI/ML workflow. The concepts related to Cloud Build- the GCP native CI/CD platform- are covered in another chapter. Cloud build works on container technology. Details about container registry and building container images are covered in another chapter.

E.3 TensorFlow Extended TensorFlow Extended (TFX) is a scalable end-to-end platform for the development and deployment of AI/ML workflows in TensorFlow. TFX includes libraries for data validation, data preprocessing, feature engineering, building, and training AI/ML models, evaluating model performance, and finally serving models as REST and gRPC APIs. You can judge the value of TFX by knowing that many Google products[217], like Chrome, Google search, Mail, etc., are powered by TFX. Google uses TFX extensively, and so does Airbnb, PayPal, and Twitter. TFX as a platform uses various libraries to make an end-to-end- ML workflow. Let us see these libraries and what they can do: TensorFlow Data Validation (TFDV): This library has modules that allow you to explore and validate your data. It allows you to visualize the data on which the model was trained and/or tested. The statistical summary that it provides can be used to detect any anomaly present in the data. It has an automatic schema generation feature which allows you to get a description of the expected range of data. Additionally, when comparing different experiments and runs, you can also use it to identify data drift. TensorFlow Transform (TFT): With the help of TFT, you can preprocess your data at scale. The functions provided by the TFT library can be used to analyze data, transform data and perform advanced feature engineering tasks. The advantage of using TFT is that the preprocessing step is modularized. A hybrid of Apache Beam and

Tensorflow allows you to process the entire dataset, like getting maximum and minimum values or all possible categories, and manipulate the data batch as Tensors. It uses the Google DataFlow cloud service. TensorFlow Estimator and Keras: This is the standard TensorFlow framework you can use to build your model and train them. It also provides you access to a good range of pre-trained models. TensorFlow Model Analysis (TFMA): It allows you to evaluate your trained model on large amounts of data in a distributed manner on the same model evaluation metrics that you defined while training. It helps analyze and understand the trained models. TensorFlow Serving (TFServing): Finally, if you are satisfied with your trained model, you can serve your model as REST and gRPC APIs for online production. The figure below shows how the different libraries are integrated to form a TFX based AI/ML pipeline. Figure E.5 TFX based AI/ML pipeline[218]

It is possible to run each of the above steps manually, however, as we discussed in the preceding section for MLOps we would like the steps to run automatically. To do this we need an orchestration tool, a tool which connects these various blocks (components) of the ML workflow togetherthis is where KubeFlow comes into picture, which will be the topic of the next section.

E.4 Kubeflow: an introduction Kubeflow allows you to manage the entire AI/ML lifecycle. It is a Kubernetes native OSS (Operations Support System) platform to develop, deploy, and manage scalable and end-to-end machine learning workloads on hybrid and multi-cloud environments. Kubeflow pipelines, a Kubeflow service, helps you to automate the entire AI/ML lifecycle - in other words lets you compose, orchestrate and automate your AI/ML workloads. It is an open-source project, and as you can see from the image of commits below - it is an active and growing project. One of the primary goals, with which Kubeflow is built, is to make it easy for everyone to develop, deploy, and manage portable, scalable machine learning. Figure E.6 Commits on Kubeflow project as shown in its Github (https://github.com/kubeflow/kubeflow/graphs/contributors) repo on 4th Feb 2021.

The best part is that even if you do not know much about Kubernetes you can use the Kubeflow API to build your AI/ML workflow. It is possible to use Kubeflow on your local machine, and on any cloud (GCP, Azure AWS), you

can choose a single node or cluster, it is built to run consistently across various environments. In November 2020, Google released Kubeflow 1.2, which allows organizations to run their ML workflow on Anthos across environments. Kubeflow is built around three key principles: Composability: Kubeflow extends Kubernetes ability to run independent and configurable steps using machine learning specific frameworks (like TensorFlow, PyTorch, etc) and libraries (Scikit-Learn, Pandas etc). This allows you to have various libraries for different tasks involved in AI/ML workflow, for instance while doing Data Processing steps- you may require a different version of TensorFlow, while during training you may be using a different version. Each task in AI/ML workflow thus can be independently containerized and worked upon. Portability: You can run all the pieces of your AI/ML workflow anywhere you want- on cloud, on-prem, or on your laptop while on vacation - the only condition- they all are running Kubeflow. Kubeflow creates an abstraction layer between your AI/ML project and your system, thus making it possible to run the ML project anywhere Kubeflow is installed. Scalability: You can have more resources when you want, and release them when not needed. Kubeflow extends the Kubernetes ability to maximize available resources and scale them with as little manual effort as possible. In this section, we will learn to use Kubeflow on the cloud-native ecosystem provided by Anthos running on Kubernetes. Some of the advantages of using Kubeflow: Standardize on a common infrastructure Leverage open-source cloud-native ecosystems for the entire AI/ML lifecycle: developing, orchestrating, deploying, and running scalable and portable AI/ML workloads. Run AI/ML workflows in hybrid and multi-cloud environments. Additionally, when running on GKE, you can take advantage of GKE’s enterprise-grade security, logging, autoscaling, and identify features. Kubeflow adds CRDs (Custom resource definitions) to the clusters. Kubeflow leverages containers and Kubernetes and thus can be used

anywhere Kubernetes is already running, especially on premises with Anthos with GKE. Below we list various Kubeflow applications and components that can be used to arrange your ML workflow on top of Kubernetes: Jupyter Notebook: For AI/ML practitioners Jupyter notebooks[219] is the de facto tool for rapid data analysis. Most data science projects start with a Jupyter notebook. It is the starting point of the modern cloudnative machine learning pipeline. The Kubeflow notebooks allow you to run your experiments locally, or if you want you can take the data, train the model and even serve it- all through the notebook. Notebooks integrate well with the rest of the infrastructure for things like accessing other services in the Kubeflow cluster using the cluster IP addresses. It also integrates with access control and authentication. Kubeflow allows one to set up multiple notebook servers, with the possibility to run multiple notebooks per server. Each notebook server belongs to a single namespace, depending upon the project or team for that server. Kubeflow provides multi-user support using namespaces, this makes it easier to collaborate and manage access. Using a notebook on Kubeflow allows you to dynamically scale resources. And the best part, it comes with all the plugins/dependencies you might need to train a model in Jupyter, including TensorBoard visualizations and customize compute resources that you might need to train the model. The Kubeflow notebooks provide the same experience as Jupyter Notebooks locally, with an added benefit of scalability, access control, collaboration and submitting jobs directly to the Kubernetes cluster. Kubeflow UI: A user interface that is used to run pipelines, create and start experiments, explore the graph, configuration, and the output of your pipeline, and even schedule runs. Katib: Hyperparameter tuning is a pivotal step in AI/ML workflow. Finding the right hyperparameter space can take a lot of effort. Katib supports hyperparameter tuning, early stopping, and neural network architecture search. It helps one to find optimum configuration for production around the metrics of choice. KubeFlow Pipelines: Kubeflow pipelines let you build a set of steps to do everything, from collecting data to serving the trained model. It is built upon containers, so each step is portable and scalable. You can use Kubeflow pipelines to orchestrate end-to-end ML workflow.

Metadata: It helps in tracking and managing the metadata that AI/ML workflows produce. This metadata logging can be used to evaluate models in real time. It can help in identifying data drift, or trainingserving skew. It can also be used for audit and compliance - you can know which models are in production and how they are behaving. Metadata component is installed with Kubeflow by default. Many Kubeflow components write to the metadata server, additionally you can write to the metadata server using your code. You can use Kubeflow UI to see the metadata- through the artifact store. KFserving: It allows one to serve AI/ML models on arbitrary frameworks. It includes features like auto scaling, networking and canary rollouts. It provides an easy to use interface for serving models in production. Using a YAML file you can provision the resources for serving and computing. The canary rollout, allows you to test and update your models without impacting the user experience. Fairing: A python package which allows you to build , train and deploy your AI/ML models in hybrid cloud environments. To summarize, Kubeflow provides a curated set of compatible tools and artifacts that lie at the heart of running production-enabled AI/ML apps. It allows businesses to standardize on a common modeling infrastructure across the entire machine learning lifecycle. Let us take a deep dive into the core set of applications and components included in Kubeflow next.

E.4.1 Kubeflow deep dive By now you know how to deploy your Anthos environment with clusters, applications, Anthos Service Mesh, and Anthos Config Management. You have the project selected and Service Management APIs enabled. Also, verify that Istio ingress gateway service is enabled for traffic. This will be used by Anthos service mesh meshes to add more complex traffic routing to their inbound traffic. To continue the deep dive, you will need to install Kubeflow[220]. Since Kubeflow uses Kustomize[221] to manage deployments across different environments. So the first task to be able to use Kubeflow in your cluster is to install Kustomize[222]. You can verify all the Kubeflow resources deployed on the cluster using

kubectl get all,

below you can see an excerpt of the output of get all

command: Figure E.7 Output of kubectl get all command

Let us now look deeply into different components and features of Kubeflow. You can also use Anthos Config Management to install and manage Kubeflow[223]. The easiest way is to try solutions from the GCP Marketplace like MiniKF or Kubeflow pipelines. The AI platform on GCP offers an easy graphical interface to create a cluster and install kubeflow pipelines for your ML/AI workflow. With just three clicks you can have a kubeflow cluster ready for use.

E.4.2 Kubeflow central dashboard

Just like all other GCP services, Kubeflow has a central dashboard that provides a quick overview of the components installed in your cluster. The dashboard provides a cool graphical user interface that you can use to run pipelines, create and start experiments, explore the graph, configuration, and the output of your pipeline, and even schedule runs. The Kubeflow central dashboard is accessible through the URL with the pattern: https://.endpoints..cloud.goog/

It is also possible to access the UI using Kubeflow command line kubectl. You will first need to set up port forwarding[224] to the Istio gateway using the command: kubectl port-forward -n svc/istio-ingressgateway 8080:80

and then access the central dashboard using: http://localhost:8080/

Remember you should have an IAP-secured web app[225] user role to be able to access Kubeflow UI. Also, you should have Istio Ingress configured to accept HTTPS traffic. The central dashboard includes the following: Home: The central dashboard that can be used for navigation between various Kubeflow components. Pipelines Notebook Servers Katib Artifact Store Manage Contributors Figure E.8 Kubeflow central dashboard using MiniKF deployment

We will explore each of these components next.

E.4.3 Kubeflow Pipelines Kubeflow pipelines, a service of Kubeflow, allows you to orchestrate your AI/ML workloads. It can be installed with Kubeflow or as a standalone. GCP marketplace offers an easy installation of Kubeflow Pipelines with a single click. As we have discussed, in the preceding sections - building an AI/ML solution is an iterative process, and hence it is important to track changes in

an orderly, organized manner - keeping a track of changes- monitoring and versioning can be challenging. Kubeflow pipelines ease the process, by providing you with easily composable, shareable, and reproducible AI/ML workflows. Kubeflow pipelines allow one to completely automate the process of training and tuning the model. To do this, Kubeflow pipelines make use of the fact that machine learning processes can be broken down into a sequence of standard steps, and these steps can be arranged in a form of a directed graph (Figure E.8). While the process appears straightforward, what complicates the matter is, it is an iterative process, the AI scientist needs to experiment with multiple types of pre-processing, feature-extraction, type of model, etc. Each such experiment results in a different trained model, these different trained models are compared in terms of their performances on the chosen metrics. The best model is then saved and deployed to production. To be able to continuously perform this operation, DevOps requires that the infrastructure is flexible enough so that the multiple experiments can coexist in the same environment. Kubeflow enables this via Kubeflow pipelines. Each of the boxes in Figure E.8 is self-contained codes conceptualized as docker containers. Since containers are portable, each of these tasks inherits the same portability. Figure E.9 The machine learning process as a directed graph

The containerization of the tasks provides portability, repeatability, and encapsulation. Each of these containerized tasks can invoke other GCP services like Dataflow, Dataproc, etc. Each task in the Kubeflow pipeline is a self-contained code packaged as a Docker image, with its inputs (arguments) and outputs. This containerization of tasks - allows portability- since they are self-contained code- you can run them anywhere. Moreover, you can use the same task in another AI/ML pipeline - the tasks can be reused. Figure E.10 Sample Graph from Kubeflow Pipeline

The Kubeflow pipeline platform consists of five main elements: A user interface for creating, managing, and tracking experiments, jobs, and runs. It uses in background Kubernetes resources and makes use of Argo[226], to orchestrate portable and scalable ML jobs on Kubernetes. It has a Python SDK, which is used to define and manipulate pipelines and components. Jupyter Notebooks which you can use to interact with the system using Python SDK. And ML Metadata which stores information about different executions,

models, datasets used, and other artifacts, in essence, metadata logging. This metadata logging allows you to visualize the metrics output and compare between different runs. The Python SDK allows one to describe the pipeline in code, one can also use the Kubeflow UI to visualize the pipeline and view different tasks. Configuring the ML pipeline as a containerized task arranged as a directed acyclic graph (DAG), enables one to run multiple experiments parallelly. Additionally, Kubeflow allows one to reuse pre-built codes, this saves a lot of time as there is no need to reinvent the wheel. GCP also offers AI Hub, which has a variety of plug-and-play reusable pipeline components. Figure E.9 shows a sample graph of Kubeflow Pipeline. Kubeflow uses domain-specific language (DSL) to describe your pipeline and components (Figure E.10). A Kubeflow component can be specified using kfp.dsl.pipeline decorator. It contains metadata fields where you can specify its name and purpose. The arguments to the component describe what this component will take as inputs. The body of the function describes the actual Kubeflow ops to be executed in the component. These Ops are the Docker containers that are executed when the task is run. Figure E.11 Basic structure of Kubeflow Component

Kubeflow pipeline allows three types of components: Pre-built components: These are prebuilt components available at the GitHub repo: https://github.com/kubeflow/pipelines/tree/master/components. There are a wide range of components available here for different platforms, from preprocessing, training to deploying a machine learning model. To use them you just require the URI to the component.yaml, which is the description of the component. It contains the URI of the container image and the component run parameters. These arguments are passed to the corresponding Kubeflow ops into pipeline code. One of the pre-built components can represent all the operations like training, tuning, and deploying a model. Lightweight Python Components: If you have small python functions, it doesn't make sense to write full Dockerfiles for each, the Kubeflow SDK allows one to wrap these lightweight functions as Kubeflow components with the help of func_to_container_op helper function,

defined in kfp.components. We pass the function as input to func_to_container_op and a base image. Custom components: In case we have functions written in other languages for example go, one can use Custom build components. In this case, you need to write the code that describes the behaviour of the component, the code that creates the container, and all the dependencies of the code. Let us take a simple example to demonstrate how Kubeflow components can be created. As the first step you will need to define your component code as a standalone python function, for example, we define a function to multiply two numbers: def mul_num(a: float, b: float) -> float: return a*b

Next generate component specification YAML using kfp.components.create_from_func:

mul_op = kfp.components.create_from_func(mul_num, output_component_file= ’mu

The YAML file is reusable, that is, you can share it with others, or reuse it in another AI/ML pipeline. And now you can create your pipeline:

@kfp.dsl.pipeline( name = ’multiply numbers’, description = ‘An example to show how to build Ku def mul_pipeline( a = 0.4, b = 0.2): first_task = mul_op(a, 2) second_task = mul_op(first_task,b)

Finally, you can create a pipeline run using create_run_from_pipeline_func().

E.4.4 Hyperparameter tuning using Katib Finding the right hyperparameters is very important in AI/ML workflow. AI scientists spend hours and sometimes days to find the right hyperparameters. The work involves frequent experimentation. It is often challenging and timeconsuming. We can automate the process of hyperparameter tuning using one of the pre-built components as described in the previous section,

alternatively, we can use Katib over Kubeflow to do the same. Katib is a scalable Kubernetes native AutoML platform, it allows both hyperparameter tuning and neural network architecture search. Figure E.12 shows the design of Katib. To learn more about how it works, readers should refer to the paper Katib: A distributed general automl platform on Kubernetes. Figure E.12 Design of Katib as general AutoML system[227]

Katib allows you to define hyperparameter tuning using both command-line via a YAML file specification, or using Jupyter Notebook and the Python SDK. It also has a graphical user interface, which you can use to specify hyperparameters to be tuned and visualize the results.

Figure E.13 shows the graphical interface of Katib. Figure E.13 Katib graphical interface

Katib allows you to choose the metric and whether you want to minimize it or maximize it. You can specify the hyperparameters you want to tune. It allows you to visualize the results of the entire experiment as well as the results of the individual runs. Figure E.14 shows the result from a Katib run for validation accuracy as the metrics and learning rate, the number of layers, and optimizer as the hyperparameters to be tuned. Figure E.14 Results of hyperparameter tuning generated by Katib

E.5 End to End ML on Kubeflow Now that we have learned about Kubeflow, it is time to put it into practice. We will be building a complete pipeline, from data ingestion to serving using Kubeflow. Since the aim of this chapter is to talk about Kubeflow, not the AI/ML models, we will work with the MNIST example and will train a basic model to classify the handwritten numerals of MNIST, the trained model will be deployed on a web interface. The complete code is available on the repo: https://github.com/EnggSols/KubeFlow. To keep it simple and cost-effective we will be using CPU-only training, we will use the command line Kubeflow. Ensure all the environment variables are properly set, and GKE API is enabled. The trained model will be stored in a storage bucket. If you do not have it already create one using gsutil: gsutil mb gs://${BUCKET_NAME}/

Here BUCKET_NAME is the unique name across your entire GCS. We use the model.py file to train; it is fairly straightforward code. With very little variation for the Kubeflow platform, the program uploads the trained model to the specified path after training. Now to perform the training on Kubeflow we will need first to build a container image, below is the Dockerfile we will use from the GitHub repo: FROM tensorflow/tensorflow:1.15.2-py3 ADD model.py /opt/model.py RUN chmod +x /opt/model.py ENTRYPOINT ["/usr/local/bin/python"] CMD ["/opt/model.py"]

We use docker build command to build the container image. Once it is built, push the image to the Google container registry so that you can run it on your cluster. Before actually pushing the image you can check locally if the model is indeed running, using docker run: docker run -it $IMAGE_PATH

You should see training logs like shown in figure E.15, this implies that training is working. Figure E.15 Training logs

If you see this log, it means you can safely push the image to the GCS registry. Now that the image is pushed we build a Kustomize YAML file by setting the required training parameters. kustomize build .

Figure E.16 shows the screenshot of the output of the Kustomize build command. Figure E.16 Kustomize build output

And finally we pipe this YAML manifest to kubectl, this deploys the training job to the cluster. kustomize build . | kubectl apply -f -

The training might take a few minutes, while the training is going on you can check the bucket to verify the trained model is uploaded there. Figure E.17 Cloud Bucket, you can see the saved models listed here

Like earlier, create the Kustomize manifest YAML for serving the model, and deploy the model to the server. Now the only step left is to run the web UI, which you can do by using the web front manifest. Establish the port forwarding, so that you can directly access the cluster and see the web UI: Figure E.18 Deployed Model

You can test it with random images, the end to end MNIST classifier is deployed.

E.6 Vertex AI Kubeflow allows you to orchestrate the MLOPs workflow, but one still needs to manage the Kubernetes cluster. An even better solution will be if we need not worry about the management of clusters at all: presenting Vertex AI Pipelines. Vertex AI provides tools for every step of the machine learning workflow: from managing datasets to different ways of training the model,

evaluating, deploying, and making predictions. Vertex AI in short is a onestop shop for AI needs. You can use the Kubeflow pipelines or TensorFlow extended pipelines in Vertex AI. Whether you are a beginner, with no code experience but have a great idea to use A I or a seasoned AI engineer, Vertex AI has something to offer to you. As a beginner you can use the AutoML feature offered by Vertex AI, you just load your data, use the data exploration tools provided by Vertex AI and train a model using AutoML. An experienced AI engineer can build their own training loops, train the model on the cloud and deploy it using endpoints. Additionally, one can train the model locally and use Vertex AI for just deployment and monitoring. In essence, Vertex AI provides a unified interface for the entire AI/ML workflow (Figure E.19). Figure E.19 Vertex AI, a unified interface for complete AI/ML workflow

Figure E.20 shows the vertex AI dashboard, in the following subsections we will explore some of the important elements available in the dashboard. Figure E.20 Vertex AI Dashboard

E.6.1 Datasets Vertex AI supports four types of managed data types: image, video, text, and tabular data. The table below lists the AI/ML tasks supported by vertexmanaged datasets. Type of Data

Tasks Supported

Image classification (single label) Image

Image classification (multi-label) Image object detection

Image segmentation

Video action recognition Video

Video classification Video object tracking

Text classification (single label) Text classification (multi-label) Text Text entity extraction Text sentiment analysis

Regression Tabular

Classification Forecasting

For image, video, and text data set, in case you do not have labels- you can upload the files directly from your computer. In case there is a file that contains image URI and its labels- you can import the files from your computer. Additionally, you can import data from Google cloud storage. Please remember that uploaded data will make use of Google Cloud Storage to store the files you upload from your computer. For tabular data, Vertex AI supports only csv files; you can upload a csv file from your computer, from cloud storage, or import a table or view from BigQuery. Once the data is specified, the Vertex AI allows you to browse and analyze the data. If the data is not labeled, you can browse the data on the browser itself and assign labels. Additionally, the Vertex AI allows you to either do

the test-training validation split manually or automatically. Figure E.21 shows the analysis of the Titanic survival dataset using vertex AI-managed datasets service. Vertex AI also provides you an option of Feature Store, which you can use to analyze the features of your data, it can help in mitigating the training serving skew- by ensuring that the same feature data distribution is used for training and serving. Feature Store can also help in detecting model/data drift. And if one requires a data annotation service- that is also available via Vertex AI.

E.6.2 Training and Experiments Training tab lists all the training jobs you are doing and have done in the Vertex AI platform. You can also use it to initiate a new training pipeline. The whole process is straightforward; just click on create and follow the instructions on the screen. If you choose AutoML, then you get the option of choosing which features to use for training and which ones to ignore; you can also mention the transformations on the tabular data. One also has the option of selecting the objective function. After making all the selections, just decide the maximum budget you want to allocate for training (minimum being 1 hour) and start training. It is always better to use the Early stopping option so that if there is no improvement in the model’s performance, training stops (Figure E.22). Experiment lets you track, visualize, and compare machine learning experiments and share them with others. Figure E.21 Choosing training parameters and transformations

Figure E.22 Selecting budget and early stopping

For the purpose of demonstration we used the HR analytics data[228] to predict whether a data scientist will go for a job change or not. We use all data except the enrolled id for training the model. The data files contain a target column that tells if a data scientist is looking for a job or not.

E.6.3 Models and Endpoint All the trained model details are provided in the Models tab. The models include the evaluation of the model on the test dataset (when trained using managed datasets). Not only this, but in the case of tabular data, one can also see the feature importance of all the input features. The information is available directly on the dashboard in both visual and text form. We had set a 1 node hour budget for model training. It took about 1 hour 35 minutes for the training to complete. Figure E.23 shows the model evaluation of the model trained by AutoML on the test dataset and Figure E.24 shows the associated confusion matrix. Figure E.23 Model Evaluation on the test dataset

Figure E.24 Confusion matrix and feature importance

One can check the prediction of the model directly from the dashboard. To test the model, the model needs to be deployed to the endpoint. Vertex AI also gives the option to save the model (TensorFlow SavedModel format) in a container, which you can use to launch your model in any other service onprem or cloud. Let us choose to deploy the model, click on the Deploy to Endpoint button. To deploy you will need to select the following options: Give a name to the endpoint. Choose traffic split, for a single model it is 100%, but if you have more than one model you can split the traffic. Choose the minimum number of compute nodes. Select the machine type. Select if you require a model explainability option, for the tabular data Vertex AI offers sampled Shapley explainability method. Choose if you want to monitor the model for feature drift, trainingserving skew, and set alert thresholds.

Once done it takes a few minutes to deploy. Now we are ready to test the prediction batch predictions are also supported). In Figure E.25, you can see that for the inputs selected, the data scientist is not looking for a job with a confidence level of 0.67. The sample request to the model can be made using REST API or through a Python client. Vertex AI endpoint includes the necessary code for both sample requests. Figure E.25 Model Prediction

All the models in your project and endpoint deployed in the project are listed in the Model and Endpoint tabs dashboards respectively.

E.6.4 Workbench

Vertex AI provides JupyterLab and Notebook support via the workbench. The user has the option of Managed Notebooks or custom Notebooks. The managed notebooks contain all the popular deep learning frameworks and modules, you can also add your own Jupyter Kernels using the docker images. The User managed notebooks offer a wide range of base environments. Users have an option to choose vCPUs and GPU while setting up the notebook. The Managed Notebooks are a good place if you want to start using Vertex AI, the User-Managed Notebooks are good if you want better control over the environment. Once the notebook is created, click on the Open JupyterLab link to access your Jupyter Lab environment. Fig E.26 - Managed Notebooks in Google Cloud Console

The Vertex AI workbench can be used to further explore the data, build and train a model, and run the code as a part of the TFx or Kubeflow pipeline.

E.6.5 Vertex AI- Final words Vetex AI provides a single interface with all the components of AI/ML workflow. You can set up pipelines to train the model and run many experiments. The interface provides an easy interface for hyperparameter tuning. Users can opt for custom training, where the user can select from containers and directly load their code for training code on their selected machine. To expedite the ML workflow, VertexAI also has AutoML integration. For managed datasets, you can use the AutoML feature to get an efficient ML model with the least ML expertise. Vertex AI also offers explainability to the models using feature attribution. Lastly, with your model available, you can set endpoints for batch or single prediction and deploy your model. The most important feature which I find while deploying is that you can even deploy on Edge devices- take your model where the data is.

E.7 Summary Present AI/ML workflows introduce technical debt making it necessary to employ MLOPs tools. GCP provides a variety of solutions for MLOps, namely: Cloud Run, TensorFlow Extend, and Kubeflow. The chapter dives deep into Kubeflow, a cloud-native solution to orchestrate your ML workflow on Kubernetes. Kubeflow provides a curated set of compatible tools and artifacts that lie at the heart of running production-enabled AI/ML apps. It allows businesses to standardize on a common modeling infrastructure across the entire machine learning lifecycle. Vertex AI provides an integrated solution for the entire AI/ML workflow. The features of Vertex AI are demonstrated by training a model using AutoML on the HR analytics dataset.

E.7.1 References 1. Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, JeanFrancois Crespo, and Dan Dennison. "Hidden technical debt in machine learning systems." In Advances in neural information processing systems, pp. 2503-2511. 2015. 2. Quionero-Candela, Joaquin, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence. Dataset shift in machine learning. The MIT Press, 2009. 3. Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. "Machine learning: The high interest credit card of technical debt." (2014). 4. Zhou, Jinan, et al. "Katib: A distributed general automl platform on kubernetes." 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019. [209]

Adapted from [1]

[210]

https://developers.google.com/machine-learning/guides/rules-of-

ml/#training-serving_skew [211]

Image source: https://cloud.google.com/solutions/machinelearning/mlops-continuous-delivery-and-automation-pipelines-in-machinelearning [212]

https://www.forbes.com/sites/forbestechcouncil/2019/04/03/whymachine-learning-models-crash-and-burn-in-production/#64ca83e92f43 [213]

https://thepoweroftwo.solutions/overview/

[214]

https://www.wired.com/story/artificial-intelligence-confrontsreproducibility-crisis/ [215]

https://cloud.google.com/blog/products/serverless/cloud-run-getswebsockets-http-2-and-grpc-bidirectional-streams [216]

https://cloud.google.com/run/quotas (The option to increase up to 8vCPUs was available as preview at the time of writing this book [217]

TF Dev Summit 2019: TensorFlow Extended Overview and PreTraining Workflow [218]

Image source: https://cloud.google.com/solutions/machinelearning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloudbuild [219]

https://www.altexsoft.com/blog/datascience/the-best-machine-learningtools-experts-top-picks/ [220]

Kindly refer to the Kubeflow documentation for latest installation instructions: https://www.kubeflow.org/docs/started/installing-kubeflow/ [221] [222]

https://github.com/kubernetes-sigs/kustomize

Take note of the Kustomize version, Kubeflow is not compatible with later versions of Kustomize, to know the latest status refer to this GitHub issue: https://github.com/kubeflow/manifests/issues/538

[223]

https://github.com/kubeflow/gcpblueprints/blob/master/kubeflow/README.md#gitopswork-in-progressusing-anthos-config-managment-to-install-and-manage-kubeflow [224]

Please remember not all UIs work behind port-forwarding to the reverse proxy, it depends on how you have configured Kubeflow. [225]

This will grant access to the app and other HTTPS resources that use

IAP. [226]

https://argoproj.github.io

[227]

From the paper Zhou, Jinan, et al. "Katib: A distributed general automl platform on kubernetes." 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019. [228]

https://www.kaggle.com/arashnic/hr-analytics-job-change-of-datascientists?select=aug_train.csv

Appendix F Compute environment running on Windows This chapter includes: What Windows containers are and how they are different from Linux containers How to run Windows containers on Anthos and GKE clusters. Unique considerations for storage and networking in Anthos and GKE Windows environments.

F.1 Windows containers To understand Windows containers, it’s important to first build an understanding of container technology. Containers are a lightweight packaging and isolation mechanism for running applications. Instead of virtualizing the hardware stack as with the virtual machine's approach, containers virtualize at the operating system level, with multiple containers running on top of the Operating System (OS) kernel directly. In essence, a "container" is generally simply a process or set of processes being run on its host operating system with some key tooling in place to isolate that process and its dependencies from the rest of the environment. The goal is to make that running process safely isolated, while taking up minimal resources from the system to perform that isolation. This poses a question: If containers are based on OS kernel-level capabilities, what does that mean for containers running on systems with totally separate kernels, as is the case with Linux compared to Windows? On Linux, the tooling used to isolate processes to create containers commonly boils down to cgroups and namespaces (among a few others), which are themselves tools built into the Linux Kernel. Linux cgroups or “control groups” are used to manage resources such as memory and cpu, while namespaces provide a

degree of isolation by separating processes into groups. For Windows containers, Microsoft implemented functionality native to the Windows kernel in order to create the native “Windows Server Container.” (Microsoft also has a concept called a “Hyper-V container” which we will explore in greater detail in a few paragraphs.) The benefits of containers, whether Linux or Windows varieties, include: Increased portability and agility Faster deployments and enhanced developer productivity Improved scalability and reliability Containers decouple the OS from the application dependencies in the code. Since the application code and the dependent libraries are packaged together into one unit, there are fewer version inconsistencies, and you won't have the problem of "It worked on my machine." A unique benefit of Windows containers is that they may help users to save on licensing costs. As a proprietary operating system, Windows has licensing which must be accounted for when running a copy of the Operating System. By using containers, applications can be isolated without the need to run a full copy of the operating system. By sharing a single underlying operating system, Windows container users can save on licensing costs, while still packaging and isolating their applications. The architecture of VM-isolated applications is compared with that of container-isolated applications in Figure F.1. Figure F.1 illustrates potential cost savings by bin packing multiple containers on a node.

F.2 Two modes of runtime isolation for Windows containers

Starting with Windows Server 2016, Microsoft began offering two modes of runtime isolation for Windows containers. The process isolation mode is most similar to Linux containers, and the Hyper-V isolation mode essentially runs the container using a very lightweight VM. Hyper-V isolated containers are meant to be used in a similar way to process isolated containers, but each container gets its kernel using a Hyper-V virtual machine rather than process isolation, unlike standard Linux containers. Process isolated containers are containers running on a host that share the same kernel with the host and are isolated from each other using resource control, namespaces, and other process isolation techniques. Of the two modes, this is the more similar implementation to native Linux containers. Windows Server Containers provide a process and resource isolation boundary and hence can be used for enterprise multi-tenancy. However, because Microsoft has stated they do not intend to service Windows container escape vulnerabilities, the use of process isolated containers is not recommended in hostile multi-tenancy scenarios or those where differing risk levels are needed. Kubernetes added support for Windows Server Containers running on Windows Server 2019 nodes in Kubernetes version 1.14, which launched in the Spring of 2019. Hyper-V isolated containers are isolated at the kernel level – each container gets its own kernel and runs inside a virtual machine (VM). This provides better security/isolation and compatibility between host and container operating system versions. They tend to have a heavier footprint, taking up more resources such as CPU and memory than process isolated containers (Windows Server Containers) due to the resource cost of the virtual machine/kernel level of isolation. The higher resource utilization and kernel level of isolation of Hyper-V containers make them least like native Linux containers. As of the time of writing, these containers are not officially supported by Kubernetes and thus not supported by GKE or Anthos. As Kubernetes does not support Hyper-V containers at this time, all discussions of Windows containers from this point on will refer to process isolated Windows Server Containers. We use the terms “Windows containers” and “Windows Server Containers” interchangeably in this text.

F.3 Using Windows containers This section outlines use cases and core concepts for using Windows containers, including workloads that are good candidates for Windows containers, an exploration of .NET vs. .NET Core applications, and details about Windows container licensing.

F.3.1 Good candidates for Windows workloads Suitable applications for Windows containers include: 1. Existing .NET Framework 3.5+ based applications – N-Tier apps, WCF services, Windows services, etc 2. All ASP.NET applications, including Web Forms, ASP.NET MVC, ASP.NET Web Pages, ASP.NET Web API, etc. 3. IIS 7+ applications, including ISAPI filters and other pipelining techniques. 4. Classic ASP applications. 5. Any console/batch application that does not have WinUI dependencies. 6. Applications refactored to the Cloud Native and Microservices-based architecture. 7. Newer applications* built on the .NET Core, .NET 5.0+ or ASP.NET Core. *Newer applications built on the .NET Core, .NET 5.0+, or ASP.NET Core have cross-platform compatibility and can be deployed as Linux or Windows containers. Applications that are not suitable for Windows containers: 1. UI applications with a visual user interface. 2. Applications that use Microsoft Distributed Transaction Coordinator (MSDTC). 3. Windows infrastructure roles (DNS, DHCP, DC, NTP, PRINT, File server, IAM, etc.) and Microsoft Office are not supported. To see the latest set of limitations, see (https://docs.microsoft.com/enus/virtualization/windowscontainers/quick-start/lift-shift-to-

containers#applications-not-supported-by-containers). Note that existing applications (including monolithic and N-Tier) don't need to be re-architected or rewritten to take advantage of the benefits of containerization and modernization. Multiple options are available depending on the use case, business needs, and the type of application, and some of the options don't involve costly rewrites or rearchitecting. The following section explains this in more detail.

F.3.2 .NET Core vs .NET Framework applications There are two popular options for modernizing an existing .NET Framework application – the application can be containerized into a Windows Server container and run on a Windows node, or the application can be ported to .NET Core/.NET 5+ and run as a Linux container on a Linux node. These two options for the process of containerizing Windows applications are illustrated in Figure F.2. Figure F.2 shows the branching processes one could take to containerize a .NET application. Either directly to Windows Server Containers running on a Windows Server operating system, or the application could be ported to .NET Core to be run in containers on a Linux operating system.

The first option (Windows Server containers on Windows) allows users to gain significant modernization benefits with a few deployment changes. It does not require changes to the code or architecture of the application and hence is an easier option. Users can containerize their application using Docker tools (e.g., Image2Docker), Visual Studio Docker extension, or manually creating a docker file and then using the CLI to generate the image. They can also use the Migrate for Anthos option to create the container images without touching the application code. With bin packing and better resource utilization brought by Windows containers, users get licensing and infrastructure savings in addition to the numerous benefits brought by containerization.

The second option (porting applications to .NET Core/.NET 5+ and using Linux Containers) requires upfront investment, and applications need to be ported or re-architected/rewritten. However, after the initial toil, the applications run without any dependency on Windows Server Licensing, resulting in 100% savings on licensing costs. Furthermore, Linux containers are lightweight and perform much better when compared to Windows containers. The choice between the two options depends on the user and their business goals and constraints. Suppose the user needs to run many replicas(high scalability requirement) of a relatively fewer number of applications. In that case, it is recommended that they spend the extra effort to rewrite their Windows applications entirely to run on Linux or port them to the .NET Core/.NET 5+. The heavy lifting upfront is a good trade-off for better performance and obviates the MS licensing dependency. On the contrary, porting to .NET Core or rewriting the apps to run on Linux can be cumbersome and painful if the user has several brownfield applications. Porting involves non-trivial effort and development investment, and it may not be possible to convert some applications because of dependencies. In such cases, running Windows Server containers on Windows hosts is a preferred and more accessible option – it provides significant benefits with a few lightweight deployment changes.

F.3.3 Container Licensing Currently, Kubernetes supports two Windows operating systems: Windows Server Standard and Windows Server Datacenter. These operating systems have no limit on the process-isolated Windows containers you may run. Windows Server Standard limits you to only two Hyper-V isolated containers, whereas Windows Server Datacenter allows unlimited Hyper-V isolated containers. At the time of writing, open-source Kubernetes and thus GKE only support process-isolated Windows containers. A Kubernetes cluster could, therefore, include nodes running any version of Windows Server Standard or Datacenter that supports process-isolated containers: Windows Server 2016,

Windows Server 2019, or in the near future, Windows Server 2022. The Windows Server versions supported by GKE are covered in more detail in the “Which Windows Versions/Types are Supported” section. Be sure to check Microsoft’s resources for the most up-to-date container licensing models. (https://docs.microsoft.com/enus/virtualization/windowscontainers/about/faq#how-are-containers-licensed-is-there-a-limit-to-the-number-of-containers-i-can-run-)

F.3.4 Windows container base images When considering how to run your .NET applications in containers, your first step will be to select a base image to build your container from which is compatible with the application you intend to run. Microsoft offers four base images for users to build Windows containers from: Windows Server Core, Windows Nano Server, Windows, and Windows Server. The majority of use cases should be fulfilled by just two of these, Windows Server Core and Windows Nano Server. Additionally, it is possible to run .NET Core applications in Linux containers. Figure F.3 illustrates how to choose the container base image to best fit the needs of .NET applications. Figure F.3 illustrates the decision tree for determining which type of container base image should be used to run a .NET application.

Windows Nano Server base image - Modernized or newly developed .NET applications running on Windows may be able to make use of the Windows Nano Server container image. The Windows Nano Server image is designed to be lightweight and resource-efficient, offering a level of resource utilization most comparable to Linux container base images. The environment has fewer of the features you would find in a full Windows Server operating system, but enough to run many applications. This image takes up the smallest footprint of the four native Windows container base images.

Windows Server Core base image - For legacy .NET applications that may have previously run on Windows Server VMs, the Windows Server Core base image offers a more fully-featured Windows Server-like environment. This is the larger of the two most commonly used base images. Compared to all four base images, it is the second smallest. Windows base image - this is the largest of Microsoft’s currently offered container base images. It offers the full Windows API, giving you the ability to run a wide variety of workloads. However, your containers will also have the largest resource utilization compared with a container running the same application but built using one of the other base images. The image is available for Windows Server 2019 LTSC and SAC variants. Windows Server base image - this base image is similar to the previously mentioned “Windows base” image, but with none of the by design constraints as it’s being treated as a “Windows Server” like license. This base image is available starting with Windows Server 2022. .NET Apps in Linux Containers - One additional consideration is running .NET applications on Linux using .NET Core/.NET 5+. This allows you to avoid Windows licensing costs entirely but requires that applications be written or rewritten to use .NET Core/.NET 5+. In this case, the application would be able to run using Linux containers and thus could be built using any of the many available Linux container base images.

F.4 How Windows containers are different from Linux containers Windows containers and Linux containers are both packaging and isolation mechanisms which, in the case of Windows Server containers, run applications while sharing an underlying kernel. Since the implementations of these concepts are unique to the Windows and Linux kernels, it makes sense that there would be a number of differences between them. As you explore container resources, you will likely find that the majority of “container” information in the world today is geared toward Linux containers, and often does not explicitly call out this assumption. This section calls out significant differences between Linux and Windows containers

which should help you identify useful container resources and products for Windows use cases. Keep them in mind as you explore the world of process packaging and isolation with containers. Windows Containers tend to require more resources. One of the most commonly encountered differences between native Windows and Linux containers is that Windows containers tend to take up considerably more resources than their Linux counterparts. A commonly cited benefit of container technology is that containers provide “lightweight,” resourceefficient isolation. This is a relative judgment, which depends largely on the implementation of the container technology as well as the application you are trying to run. While the size of the application being run is one factor, another key factor in the larger resource utilization on average for Windows containers when compared with Linux containers can be largely attributed to the design of the two available Windows container base images, Windows Server Core and Windows Nano Server, discussed in the previous section. Windows Containers have fewer available base images. As a proprietary technology, the base images available for Windows Server Containers are limited to the “flavors” provided by Microsoft and based on their Windows Server operating system. This is a difference from the plethora of base images available both from open source and proprietary sources for Linux containers. Windows Container image pull times tend to be slower. Windows container base images are generally considerably larger than Linux container base images. Therefore, pulling Windows images from a repository will generally to take longer due to the larger image size. Linux security features vs. Windows security features. Windows Server containers running in process-isolated mode are similar to Linux containers in terms of the level of isolation. Nonetheless, some security features specific to Linux systems and containers, such as Seccomp and SELinux, don't have the Windows analog. HostProcess containers for Windows are the equivalent of Linux privileged containers and provide privileged access to the host – it is disabled in the baseline policy. Note that HostProces containers are not generally available at the time of writing (beta feature as of Kubernetes v1.23).

RunAsUserName is the equivalent of Linux RunAsUser can be used to run the containers using a non-default user. It can take values such as ContainerAdministrator and ContainerUser. ContainerAdministrator has elevated privileges and is used to perform administrative tasks, install services and tools, and make configuration changes. Following the principle of least privilege, running the container as ContainerUser is recommended unless admin access is needed. If the use case requires running hostile multi-tenant workloads, i.e., if at least one of the workloads is untrusted, running Windows Server containers with process isolation in the same cluster is not recommended. Also, there are some differences in how Linux handles File paths, File Permissions, Signals, Identity, etc. compared with how Windows handles those aspects. For instance, Windows uses backslashes in the file path instead of the forward slash. It is good to be aware of such differences. See (https://kubernetes.io/docs/concepts/windows/intro/) for more information.

F.5 Windows containers on Anthos and Google Kubernetes Engine (GKE) clusters This section explores the architecture and considerations of both Anthos and GKE clusters running Windows workloads. The two types of clusters have considerable overlap in terms of architecture, so the two are explained handin-hand. On-Prem Anthos clusters, which have some unique considerations, are called out separately later in this section. Often the best way to learn is to try it out for yourself, so in this section you will find a hands-on tutorial that walks through creating a Windows container and deploying it onto Kubernetes via a GKE cluster. The process would be similar, though with slightly different tools, for Anthos clusters running in non-GCP environments.

F.5.1 Architecture of Anthos and Google Kubernetes Engine (GKE) clusters with Windows node pools When running containers at scale, a container orchestration tool is needed. By

far, the most popular container orchestration tool is Kubernetes. Anthos managed Kubernetes clusters can run in a variety of environments, and wherever they are run, they are primarily based on the architecture of GKE clusters. Be sure to check out the chapter Computing environment built on Kubernetes in this book for a more complete overview of Google Kubernetes Engine and how it relates to Anthos. This section provides a reference architecture for running Windows container-based applications via a GKE or Anthos cluster with Windows node pools. Figure F.4 Illustration of Windows Server and Linux containers running side-by-side in the same cluster

Figure F.4 illustrates the high-level architecture of an Anthos or GKE cluster with Windows node pools. The green block represents the managed master or control plane that continues to run using Linux, and the yellow blocks represent the worker node pools. Windows Server node pools can be attached to an existing or new GKE or Anthos cluster, just as you would add Linux node pools. The shared control plane for Linux and Windows is the key factor that provides a consistent experience. A single cluster can have multiple Windows Server node pools using different Windows Server versions, but each node pool can only use one Windows Server version. Labels are used to funnel Windows pods to Windows nodes. Taints and tolerations can be used to prevent or repel Linux pods from running in Windows nodes. Kubelet and Kube-proxy run natively on Windows nodes. The overall architecture allows you to seamlessly run mixed Windows and Linux container workloads in the same cluster. Advantages of running Windows workloads using Anthos Modernization Benefits: Containerizing and running Windows workloads using Kubernetes allows users to take advantage of many of the modernization benefits previously limited to Linux-based applications. There are numerous advantages, from better scalability and portability to simplified management and speed of deployment. Operational Consistency: By running Windows and Linux workloads sideby-side in a consistent manner, users get operational efficiency. You no longer need multiple teams specializing in different tooling or platforms for managing different types of applications – you can have consistent operations across Windows and Linux applications. Cost Savings: GKE/Anthos Windows provides better bin packing - by running multiple applications as containers in a worker node, users get better resource utilization and hence the infrastructure savings, and more importantly, you also benefit from Windows Server license savings. Which Windows Versions/Types are Supported

At the time of writing, Windows Node Pools in GKE and Anthos clusters support the Windows Server 2019 LTSC (Long-Term Servicing Channel) version as the Node OS image. And support for Windows Server 2022 is in the pipeline. Although support for Windows containers was added with Windows Server 2016, Kubernetes made support for Windows Server containers available in GA in version 1.14 in March 2019 with Windows Server 2019. Consequently, Windows Server 2016 is not supported. Note that GKE previously supported Semi-Annual Channel (SAC) versions of Windows Server as well; however, Microsoft deprecated support for the SAC channel starting with Windows Server 2022. There are two options for the container runtime – Docker and containerd. The container runtime is used by Kubernetes nodes to launch and manage the containers that make up the Kubernetes pods. Containerd support was released by Kubernetes open-source (OSS) community in 2018 and has shown to decrease resource usage and improve startup latency. For GKE and Anthos clusters created in 2022 and later, it is strongly recommended that you use the containerd runtime (which is the default) if available. Compatibility between the container base OS version and the host OS For Windows Server containers deployed in process isolation mode, the operating system version of the container's base image must match the host's version (illustrated in Figure F.5). In the four-part version tuple major.minor.build.patch, the first three parts (i.e., the major, minor, and build versions) need to match. The patch versions need not match. This restriction does not apply for Windows Server containers running in Hyper-V isolated mode, and the versions can be different. Figure F.5 Windows Container versions must share the third value of the container tuple. For example, a host running Windows Server version 10.0.16299.1 would be compatible with a container running version 10.0.16299.0, but the same container would not be compatible with a host running version 10.0.17134.0.

Prerequisites 1. Even for a dedicated Windows cluster, at least one Linux node is required in the GKE/Anthos cluster because some system pods only run on Linux nodes. 2. Larger machine types are recommended for running Windows Server containers: n1-standard-2 is the minimum recommended machine type on GCP as Windows Server nodes require additional resources. Smaller machine types f1-micro and g1-small are not supported. 3. Licensing:

a. In the GCP cloud environment, the license is baked into the VM. When a user adds a Windows VM to their GKE or Anthos cluster, the corresponding license is also added. It is the same as how customers provision a Windows VM in GCE. Check Google Cloud’s compute disk image pricing documentation for the most up-to-date information and details. b. In the Anthos On-Premises environment, users need to procure their Windows Server licenses (BYOL - Bring your own license model). Consequently, users need to download a Windows Server ISO from Microsoft or use their company-curated OS image per Microsoft licensing terms. A vanilla OS image from Microsoft is recommended. Users need to create a base Windows VM template from the Windows Server ISO, which gets used when a node pool is added to the user cluster. Running Windows Containers on Google Cloud and GKE (Tutorial) This section provides a hands-on tutorial to create a Windows container and run it on GKE. It is adapted from the “Running Windows containers on Google Cloud” codelabs available at: Part 1: https://codelabs.developers.google.com/codelabs/cloud-windowscontainers-computeengine Part 2: https://codelabs.developers.google.com/codelabs/cloud-windowscontainers-kubernetesengine Setup and Requirements This tutorial expects you to have a Google Cloud Platform project to work within. You can use an existing one, or create a new one. Part 1: Creating a Windows Container Create a Windows VM

In order to create a Windows container, you will need access to a Windows operating system which supports them. Begin by creating a Windows Server instance in Google Cloud on which you will create your Windows Container. You could do this either using the gcloud cli, or the console. In this section, we provide instructions for doing this in the Google Cloud Console. In the Google Cloud Platform Console, go to the Compute Engine section and create an instance. For this exercise, select Windows Server 2019 Datacenter for Containers as the version for your boot disk, as seen in Figure F.6. Figure F.6 The “Boot disk” section of the VM creation workflow on Google Cloud Platform (GCP) has an “OS images” tab where you can select the desired Operating System image for your VM. For this tutorial, we will use “Windows Server 2019 Datacenter for Containers.”

Also make sure that HTTP and HTTPS traffic are enabled for this VM, as shown in the below screenshot. Figure F.7 The “Firewall” section of the VM creation workflow offers two checkboxes which

must be checked for this tutorial: “Allow HTTP traffic” and “Allow HTTPS traffic.”

Select “Allow full access to all Cloud APIs,” as shown in Figure F.8. Figure F.8 In the “Access scopes” subsection of the “Identity and API access” section of the VM creation workflow, select “Allow full access to all Cloud APIs.”

After you click “Create,” it will take a few minutes for the VM to start up. Once the VM has started, you should see your VM running in the console, as shown in Figure F.9. Figure F.9 A running VM instance in Google Cloud will display the VM name, with additional

information such as region and IP shown in line to the right of the name.

Remote Desktop (RDP) in to the Windows VM To access the VM you created, you will need to configure a password. There are two ways to do this. You can either set a new Windows password using the RDP button as seen in Figure F.10: Figure F.10 Clicking the “RDP” drop down menu as shown in the VM information from the previous image will give you an option to Set the Windows password.

Alternatively, you can click “View gcloud command to reset password” and run the command displayed in order to set the password. After a few seconds, you should see the Windows password in the console or Cloud Shell. Make sure you make a secure note of it. You will need it to access the Windows VM. To log in to the Windows VM, you can click the RDP button of the VM, as shown in the last two images, this time clicking the “RDP” rather than the drop down symbol. You may also use your own RDP client to log in to the VM if you prefer.

Once inside the VM, open a command prompt in admin mode. Docker and the Windows Server Core base image are installed by default, which you can confirm by running “docker images” as shown in Figure F.11. Figure F.11 Running “docker images” in the command prompt of the created Windows VM will provide output showing that the Windows Server Core base image is already installed.

Create an application to containerize For the app inside the Windows container, we will use an IIS Web Server. IIS has an image for Windows Server 2019. Using the image as-is will allow us to serve the default IIS page. Let’s configure IIS to instead serve a custom page. Create a folder called my-windows app with the following folder and file structure: C:\my-windows-app>dir /s /b C:\my-windows-app\content C:\my-windows-app\Dockerfile C:\my-windows-app\content\index.html

Replace index.html with the following content:

Windows containers

Windows containers are cool!



This is the page IIS will serve.

Build container image using Docker To containerize this IIS Web Server application using Docker, you will need to create a file called a Dockerfile. The Dockerfile consists of several key commands which essentially instruct Docker on how to create a container image to run your application. Create a file called “Dockerfile,” with no file extension, consisting of the following lines:

FROM mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019 RUN powershell -NoProfile -Command Remove-Item -Recurse C:\inetpub\wwwroot\* WORKDIR /inetpub/wwwroot COPY content/ .

Build the Docker image and tag it with Google Container Registry (GCR) and your project id. This will be useful when we push the image to GCR later. Run the following command, replacing [project id] with your project id. docker build -t gcr.io/[project id]/iis-site-windows .

Once the docker image build has completed, you can see it along with its IIS dependency by running “docker images.” Run Windows container Before running the container, you might need to open the port 80 from the VM Instance. Inside the Command Prompt inside the Windows VM, run the following:

C:\>netsh advfirewall firewall add rule name="TCP Port 80" dir=in action=all C:\>netsh advfirewall firewall add rule name="TCP Port 80" dir=out action=al

You should now be ready to run the Windows container. To run the container and expose it on port 80, execute the following command, replacing [project id] with your project id. docker run -d -p 80:80 gcr.io/[project id]/iis-site-windows

Confirm that the container is now running by running the command docker ps. To see the web page, go to the External IP column of the Compute Engine instance and simply open it with HTTP in the browser. It should look similar to Figure F.12. Figure F.12 Inputting the External IP of your workload into a browser should open a website proclaiming “Windows containers are cool!”

You are now running an IIS site inside a Windows container! Note that this setup is not ideal for production. It does not survive server restarts or crashes. In a production system, you want to get a static IP for your VM and have a startup script to start the container. This will take care of server restarts but doesn't help so much for server crashes. To make the app resilient against server crashes, you can run the container inside a pod managed by Kubernetes. This is what you will do in Part 2. Part 2: Running a Windows Container on GKE Push the container image to Google Container Registry (GCR) To make the container image you created in Part 1 accessible in GKE, you will need to host it in an image repository. We will use the Google Container Registry for this purpose. To push a container image from a Windows VM to GCR, you need to: 1. Make sure that the Container Registry API is enabled in your project. 2. Configure Docker to point to GCR.

First, make sure the Container Registry API is enabled by running the following gcloud command in the command prompt with administrator privileges in your Windows VM: gcloud services enable containerregistry.googleapis.com

Configure Docker to point to GCR by running: gcloud auth configure-docker

When you are asked if you wish to continue, enter “Y.” To push the image to GCR, run the following docker command, replacing [project id] with your project id: docker push gcr.io/[project id]/iis-site-windows

If you go to GCR section in Cloud Console, you should be able to see the image, as shown in Figure F.13. Figure F.13 Navigating to the Images section of your container registry in Google cloud should display information about the iis-site-windows.

Create a Kubernetes cluster with Windows nodes

For this exercise, you will create a zonal GKE cluster. This cluster will only have nodes within a single zone, as opposed to a regional GKE cluster which has multiple control planes and nodes in multiple zones. Before creating a GKE cluster, make sure the project id is set to your project and a compute/zone is set the zone you want (replace [project id] and [preferred zone] with your preferred zone): gcloud config set project [project id] export ZONE=[preferred zone] gcloud config set compute/zone ${ZONE}

Creating a Kubernetes cluster in GKE with Windows nodes happens in 2 steps: 1. Create a GKE cluster with IP alias and 1 Linux node. At least 1 Linux node is needed before Windows nodes can be added to the cluster. 2. Add a Windows node pool to the GKE cluster. Use the following command to create a GKE cluster. Replace [cluster name] with your preferred name for the cluster. export CLUSTER_NAME=[cluster name] gcloud container clusters create ${CLUSTER_NAME} \ --enable-ip-alias \ --num-nodes=2

Once your GKE cluster is up, you can add a node pool consisting of Windows nodes to it: gcloud container node-pools create windows-node-pool \ --cluster=${CLUSTER_NAME} \ --image-type=WINDOWS_LTSC \ --no-enable-autoupgrade \ --machine-type=n1-standard-2

Notice that we're disabling automatic node upgrades. Windows container versions need to be compatible with the node OS version. To avoid unexpected workload disruption, it is recommended that users disable node autoupgrade for Windows node pools.

For Windows Server containers in GKE, you're already licensed for underlying Windows host VMs – containers need no additional licensing. Configure kubectl for your GKE cluster You will use the Kubernetes command line tool, kubectl, to interact with your cluster. To configure kubectl, run the command: gcloud container clusters get-credentials ${CLUSTER_NAME}

Before using the cluster, wait for several seconds until windows.config.common-webhooks.networking.gke.io is created. This webhook adds scheduling tolerations to Pods created with the kubernetes.io/os: windows (or beta.kubernetes.io/os: windows) node selector to ensure they are allowed to run on Windows Server nodes. It also validates the Pod to ensure that it only uses features supported on Windows. You can confirm that the webhooks were created by outputting webhook configuration information using the following command: kubectl get mutatingwebhookconfigurations

Run your Windows container as a pod in GKE Kubernetes takes a declarative approach to running containers. A user defines their desired state using yaml files, which Kubernetes uses to create and maintain objects to match the desired state. To run your IIS container as a pod in Kubernetes, you will need to create a file called iis-site-windows.yaml consisting of the following lines. Be sure to replace [project id] with your project id. iis-site-windows.yaml apiVersion: apps/v1 kind: Deployment metadata: name: iis-site-windows labels:

app: iis-site-windows spec: replicas: 2 selector: matchLabels: app: iis-site-windows template: metadata: labels: app: iis-site-windows spec: nodeSelector: kubernetes.io/os: windows containers: - name: iis-site-windows image: gcr.io/${PROJECT_ID}/iis-site-windows ports: - containerPort: 80

Note that this yaml will create two replica pods running the image you published earlier to GCR. This yaml also ensures that your pods will be run on Windows nodes using the nodeSelector tag. To create the deployment defined in the iis-site-windows.yaml file, run: kubectl apply -f iis-site-windows.yaml

After a few minutes, you should see the deployment created and pods running. Run the following command to confirm: kubectl get deployment,pods

You should see output similar to Figure F.14. Figure F.14 The “kubectl get” command is used to display high-level information about Kubernetes objects. In this case, deployments and pods in the default namespace are shown.

Create a Kubernetes Service Your application is now running in Kubernetes. To confirm the IIS site is up, you will need to create a Kubernetes service to make the created pods accessible to the outside world. Run the following kubectl command to create a service of type LoadBalancer which can be used to reach your IIS Web Server. kubectl expose deployment iis-site-windows --type="LoadBalancer"

You can confirm the service is up by running: kubectl get service

Note that it will take a few minutes for the service to be up and running. Once the EXTERNAL-IP section of the “kubectl get service” command’s output has been populated, you can visit that IP to see the page IIS is serving. The website should look the same as before despite being reached on a different IP, as shown in Figure F.15. Figure F.15 A webpage proclaiming “Windows containers are cool!” should be displayed when you enter the External IP of your LoadBalancer type Kubernetes service.

This web page should be the same as the one you saw in part 1. The big difference now is that Kubernetes is managing the application. If something goes wrong with the pod running your application, or with the node on which the pod is running, Kubernetes will recreate and reschedule the pod for us. This is great for resiliency. Clean Up Once you have finished exploring your containerized application running GKE, it’s a good idea to delete the cluster to avoid incurring additional costs from idle resources. If you’re done with the resources used in this tutorial, follow these steps to delete them. To clean up your environment, first delete the service and the deployment. This will also automatically delete the external load balancer that was created for that service: kubectl delete service,deployment iis-site-windows

Then, delete your GKE cluster: gcloud container clusters delete windows-cluster

You should also delete the VM you used to create your Windows container. This can be done either in the console or via command line. To delete the VM via the Google Cloud Console, go to Compute Engine VM instances page and select Delete from the menu for the VM you want to delete, as shown in Figure F.16: Figure F.16 Clicking the dots at the rightmost end of the line of VM information should give you the option to delete that VM via the Google Cloud console.

Running Windows Containers on Anthos On-Premises The user experience running Windows containers in either Anthos or GKE is fairly consistent across the two. However, some additional setup and installation requirements are specific to Anthos clusters in on-prem environments, and this section highlights those requirements. A general Anthos architecture on-prem is shown in Figure F.17. Figure F.17 Illustration of Windows Server and Linux containers running side-by-side in the same Anthos on-prem VMware cluster

The differences for on-prem environments originate primarily due to two reasons. First, Anthos users must procure their licenses (BYOL) in the OnPrem environment. In a BYOL model, Google cannot ship the Windows Server OS image with the Anthos bits.Consequently, users have to take the additional steps to download the OS ISO image from Microsoft and create a VM template for the node-pool creation. Also, when Microsoft releases security patches for the OS image, Google will test and qualify the latest security patch version and publish the results. Users will need to build a new VM template with the security patch and perform a rolling update on their Windows node pools.

Second, In the Anthos On VMware environment, admin clusters and the underlying VMware constructs come into play, and hence it is essential to ensure those prerequisites are met. Windows nodes are supported running as user cluster worker nodes only, the control plane node and admin cluster nodes continue to be Linux-based. Specifically, in addition to the prerequisites listed above, users need to ensure the following: 1. An admin cluster is in place before you can create a Windows node pool, because a Windows node pool is only supported in the user cluster. 2. The vSphere environment is vSphere 6.7, Update 3 or later. 3. A user cluster with Windows node pools must have the enabledataplanev2 field set to true in the user cluster configuration file. This enables Dataplane V2 on the Linux nodes in that cluster. Additionally, If you want Windows Dataplane V2 to be enabled for the Windows node pools in the user cluster, the user cluster configuration file must have the enableWindowsDataplaneV2 field set to true. High-level steps for adding Windows node pool to an Anthos On-Prem cluster include: 1. Create the Windows VM Template for Anthos clusters on VMware. 2. Upload Windows container images to a private registry if using a private registry. 3. If your cluster is behind a proxy server, allowlist URLs to your proxy server. 4. Add a Windows node pool to the user cluster configuration file. 5. Finally, create the Windows node pool. Once this is complete, you can deploy your Windows containers just as you would deploy to GKE. For details, check out the step-by-step instructions in the Google Cloud documentation page for "User guide for Windows Server OS node pools".

F.5.2 Unique storage, networking, and identity considerations for Anthos and Google Kubernetes Engine (GKE) Windows

environments Storage and networking are key to running workloads in distributed computing environments such as Kubernetes clusters. Kubernetes is well known for its capabilities in the realm of stateless workloads, but it can be a great home for stateful workloads too. Networking containers together both within a single machine and across a cluster of machines, as Kubernetes does, involves layers of abstraction from hardware through various levels of software. This section explores unique considerations for storage and networking common to both Anthos and GKE Windows environments.

F.5.3 Storage In a nutshell, the way Kubernetes manages storage in Windows is similar to Linux. The volume plugins including in-tree and CSI (mentioned in appendix Anthos, Data and Analytics ) provide an abstraction layer for storage vendors to support provisioning, attaching, and mounting of storage for containers. End-users interact with Persistent Volumes (PVs), Permanent Volumes Claims (PVCs), and StorageClass objects, the same way as Linux. Windows Storage Limitations However, Windows storage support has some limitations compared to Linux at the time of writing. Some are listed in the following. (Check the documentation here to see more details.) Docker runtime can only support volume mounts targeting a directory, not a file. However, the containerd runtime no longer has this limitation and is strongly recommended instead of the Docker runtime. Windows storage currently does NOT support the following Memory as the storage medium for ephemeral storage. As a result, if you define an emptyDir volume, you cannot set its emptyDir.medium to memory raw block volume (Windows cannot attach raw block devices to pods.) mountPropagation with Bidirectional for volume mounts Read-only volumes are supported for Windows but not read-only

filesystems. User-masks and permissions are not supported for the volume. Permissions are instead resolved at the container. Windows CSI support In the past, storage providers would have to create plugins to enable their storage through Kubernetes. However, storage plugins were treated as “intree” components, meaning they had to become part of the core Kubernetes code to be used. This model was problematic as it opened up Kubernetes’ core code to bloat and made the process of creating and maintaining plugins challenging for storage providers. Due to its flaws and limitations, the plugin system was replaced in Kubernetes version 1.13 when the Container Storage Interface (CSI) became Generally Available (GA). Since then, storage providers have created CSIcompatible plugins which allow their users to consume storage via Kubernetes. Windows support in Kubernetes was introduced in Kubernetes version 1.19. However, the initial versions did not support HostProcess containers (the equivalent of Privileged containers for Linux), which the CSI volume plugin requires to deploy as containers. HostProcess container support is currently available in Alpha as of Kubernetes 1.22, and the Beta is slated to launch with Kubernetes 1.23. An interim solution is to use the Open Source Software (OSS) project CSIProxy. The CSI-proxy provides a mechanism to deploy the node plugins as unprivileged pods and use the proxy to perform privileged storage operations on the node. The CSI Proxy API includes the Disk, Filesystem, SMB, and Volume API groups, which graduated to v1 in 2021. It also has iSCSI and system APIs, which are still Alpha at the time of writing. Anthos aims to provide a consistent experience across a variety of environments. However, not all environments always move forward at the same rate. Be sure to check the Anthos documentation (https://cloud.google.com/anthos/docs/concepts/overview) for the types of

storage drivers available in each environment. Windows Google Cloud Storage Drivers in Action The following shows an example of using StorageClass, PV/PVC and pod to access persistent storage in Anthos on Google Cloud. apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: csi-gce-pd-windows parameters: type: pd-balanced provisioner: pd.csi.storage.gke.io reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer --kind: PersistentVolumeClaim apiVersion: v1 metadata: name: podpvc-csi spec: accessModes: - ReadWriteOnce storageClassName: csi-gce-pd-windows resources: requests: storage: 20Gi --apiVersion: v1 kind: Pod metadata: name: web-server spec: tolerations: - operator: Exists nodeSelector: kubernetes.io/os: windows containers: - name: web-server imagePullPolicy: Always image: k8s.gcr.io/e2e-test-images/agnhost:2.32 volumeMounts: - mountPath: /www/html name: mypvc

volumes: - name: mypvc persistentVolumeClaim: claimName: podpvc-csi readOnly: false

Data sharing between Linux and Windows using SMB There are several scenarios where data sharing is required between pods running on Linux nodes and those running on Windows nodes. The following examples illustrate scenarios that may require data sharing: Suppose an app is being migrated from the .NET framework on Windows to .NET core on Linux, some parts are migrated, and other parts of the app are still using Windows containers. Logs written by Windows pods need to be captured and shipped to a common search and analytics engine, and the log shipper is written in Linux. Suppose a legacy .NET app running using Windows Server Containers is getting new features, and the new features are developed as separate, new microservices running on Linux using .NET Core. And let us say old and new parts of the app work against the same database. You can use the Server Message Block (SMB) Protocol in such scenarios. SMB is one of the most popular network file-sharing protocols for Windows users. The CSI Windows community has launched an OSS SMB CSI driver, and it supports both Linux and Windows clusters. We recommend using this driver to access SMB volumes either on a self-managed samba server or cloud volume service (CVS) in GKE. You can check the Google Cloud documentation which contains an example of how to use the open source SMB CSI Driver for Kubernetes to access a NetApp Cloud Volumes Service SMB volume on a Google Kubernetes Engine (GKE) cluster with Windows server nodes.

F.5.4 Networking Like Linux networking, Windows networking relies on Container Network Interface (CNI) to connect Kubernetes pods into the cluster networking.

There are two modes of Windows network implementation for GKE and Anthos: 1. The traditional Dataplane for Windows, which is based on win-bridge (L2bridge networking mode) 2. The newer Dataplane V2 for Windows, which is based on Open vSwitch (OVS) and Antrea. Traditional Dataplane for Windows In GKE and Anthos on-prem (on VMware), win-bridge network mode is used to connect pods to the underlying network - it leverages the Hyper-V virtual switch (vSwitch). There is a small difference between GKE and Anthos on-prem (VMware) for cross-node connectivity. In GKE's VPC native cluster, pods use real VPC IP addresses in the secondary IP address range, so pod IPs are native routable inside the VPC. Pods across different nodes are natively supported. In Anthos on-prem (on VMware), there is no VPC support. Flannel is used to exchange routes between Windows nodes. Dataplane v2 on Linux nodes is required to exchange routes between Linux and Windows nodes. GKE/Anthos Dataplane V2 for Windows The newer Dataplane V2 is a more programmable Dataplane that is optimized for Kubernetes networking and can perform Kubernetes-aware packet manipulations without sacrificing performance. While the Dataplane V2 for Linux is based on eBPF (Extended Berkeley Packet Filter) and Cilium, the Windows solution is based on Open vSwitch and Antrea (an OSS project that implements CNI). Dataplane V2 helps overcome some of the limitations of the traditional implementation and enables features such as network policy. It adds new components to the cluster - antrea-agent and antrea-controller. While the Antrea-agent runs on every node and programs Open vSwitch datapath, the Antrea-controller runs one instance per cluster. Antrea-controller watches the

Kubernetes API server for updates and exposes API for antrea-agent to query per node policy rules. GKE Dataplane V2 provides a consistent user experience for networking across all Anthos and GKE environments, provides real-time visibility of network activity, and has a simpler architecture that makes troubleshooting easier. Furthermore, most new features will only be supported in the Dataplane V2. Hence, the use of Dataplane V2 for your container networking is strongly recommended. See the chapter on Networking environment for more details about networking. To create a new cluster with GKE Dataplane V2, use the following command: gcloud container clusters create CLUSTER_NAME \ --enable-dataplane-v2 \ --enable-ip-alias \ --release-channel CHANNEL_NAME \ --region COMPUTE_REGION

F.5.5 Active Directory Integration Active Directory(AD) integration is the most frequently used mechanism for authentication and authorization in Windows-based networks. Active Directory keeps track of objects in the network in a data store and makes it easy for authorized users to discover and access the resources. Windows containers cannot be domain-joined directly. Nonetheless, you can configure and use a group Managed Service Account (gMSA) for applications running in Windows containers to utilize AD functionality. Anthos and GKE support group Managed Service Accounts – you can configure gMSA credential specs as custom resources in your cluster. Windows pods can then be configured to use a gMSA to access and interact with other resources and services. Check the documentation (https://cloud.google.com/kubernetesengine/docs/how-to/creating-a-cluster-windows#using_gmsa) and tutorial (https://cloud.google.com/architecture/deploying-aspnet-with-windowsauthentication-in-gke-windows-containers) for more details.

F.6 Summary

Windows containers, as with Linux containers, come with several benefits - better agility, improved scalability and reliability, cost savings achieved through bin packing, to name a few. There are two modes of runtime isolation for Windows containers - the process isolation and the Hyper-V isolation. Process isolation mode is more similar to native Linux containers, while the Hyper-V isolation provides more security and isolation and is better suited for hostile multi-tenancy scenarios. There are two popular options for modernizing an existing .NET Framework application – the application can be containerized into a Windows Server container and run on a Windows node, or the application can be ported to .NET Core/.NET 5+ and run as a Linux container on a Linux node. Anthos/GKE allows you to run your Linux and Windows containers side-by-side in the same cluster, resulting in consistent experience and operational efficiency, among several other benefits. For Windows Server containers deployed in process isolation mode, the operating system version of the container's base image must match the host's version. The experience of running Windows containers in different Anthos environments is pretty consistent. However, there are some differences to consider, such as the OS licensing model for the Windows nodes. The way Kubernetes manages storage and networking in Windows is similar to Linux. The volume plugins (in-tree and CSI) provide an abstraction layer, and end-users interact with Persistent Volumes (PVs), Permanent Volumes Claims (PVCs), and StorageClass objects, the same way as Linux. The traditional Dataplane for Windows is based on winbridge, and the newer Dataplane V2 for Windows is based on Open vSwitch (OVS) and Antrea. Active Directory (AD) integration is the most common approach to authentication and authorization for Windows-based Networks. AD gMSAs are supported by Anthos and GKE.