275 110 10MB
English Pages 274 [273]
Successful Management of Cloud Computing and DevOps
Successful Management of Cloud Computing and DevOps Alka Jarvis Prakash Anand Johnson Jose
Milwaukee, WI
Successful Management of Cloud Computing and DevOps Alka Jarvis, Prakash Anand, Johnson Jose American Society for Quality, Quality Press, Milwaukee, WI, 53203 All rights reserved. Published 2022 © 2022 by Jarvis, Anand, Jose No part of this book may be reproduced in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Publisher’s Cataloging-in-Publication Data Names: Jarvis, Alka S., author. | Anand, Prakash, author. | Jose, Johnson, author. Title: Successful management of cloud computing and DevOps / by Alka Jarvis, Prakash Anand, Johnson Jose. Description: Includes bibliographical references. | Milwaukee, WI: Quality Press, 2022. Identifiers: LCCN: 2021951464 | 978-1-63694-009-0 (paperback) | 978-1-63694-010-6 (epub) Subjects: LCSH Cloud computing. | Web services. | Database management. | Software engineering. | Computer software—Development. | Agile software development. | BISAC TECHNOLOGY & ENGINEERING / Data Transmission Systems / General | BUSINESS & ECONOMICS / Information Management | COMPUTERS / Data Science / Data Warehousing | COMPUTERS / Database Administration & Management Classification: LCC QA76.585 .J37 2022 | DDC 004.6782—dc23 Library of Congress Control Number: 2021951464 ASQ advances individual, organizational, and community excellence worldwide through learning, quality improvement, and knowledge exchange. Bookstores, wholesalers, schools, libraries, businesses, and organizations: Quality Press books are available at quantity discounts for bulk purchases for business, trade, or educational uses. For more information, please contact Quality Press at 800-248-1946 or [email protected]. To place orders or browse the selection of all Quality Press titles, visit our website at: http://www.asq.org/quality-press. Printed in the United States of America 26 25 24 23 22 GP 7 6 5 4 3 2 1 Quality Press 600 N. Plankinton Ave. Milwaukee, WI 53203-2914 E-mail: [email protected]
Excellence Through QualityTM
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acronyms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Part I: Process Chapter 1: Business Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Core Process Digitization Model. . . . . . . . . . . . . . . . . . . . . . . . 4 Planning for Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Agile Business Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Lean Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Lean Software Development (LSD). . . . . . . . . . . . . . . . . . . . . . 8 Software Development Life Cycle (SDLC). . . . . . . . . . . . . . . . . 9 Imperative Business Strategies. . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2: Data Center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Traditional Data Centers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 What Is a Data Center?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 The Role of a Data Center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 What Goes into a Data Center?. . . . . . . . . . . . . . . . . . . . . . . . . 14 On-Site Data Center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Choosing a Site for the Data Center. . . . . . . . . . . . . . . . . . . . . . 17 The Benefits and Limitations of a Traditional Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
v
vi | Contents
Chapter 3: Overview of the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Origins of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Characteristics of Cloud Computing. . . . . . . . . . . . . . . . . . . . . 25 Cloud Computing—the Benefits and Customer Expectations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Types of Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 When Should Cloud Computing Be Used?. . . . . . . . . . . . . . . . 46 CapEx or OpEx?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Cloud Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Cloud Sourcing and Procurement. . . . . . . . . . . . . . . . . . . . . . . 53 TCO for Cloud Computing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Common Cloud Misconceptions. . . . . . . . . . . . . . . . . . . . . . . . 63 State of the Cloud. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Part II: Technology Chapter 4: Cloud Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 How Does the Cloud Work?. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 ITOps, DevOps, SRE, and the Cloud. . . . . . . . . . . . . . . . . . . . . 73 Organization Structure and the Cloud. . . . . . . . . . . . . . . . . . . . 74 Why Manage the Cloud?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Introduction to Cloud Management . . . . . . . . . . . . . . . . . . . . . 78 Cloud Usage and Cost Management. . . . . . . . . . . . . . . . . . . . . 92 Cloud Migration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Cloud Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Cloud Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Chapter 5: Cloud Metrics and Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . 123 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Cloud versus On-Prem—When, What, and Where to Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Common Monitoring Framework . . . . . . . . . . . . . . . . . . . . . . . 124 Cloud Service Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Cloud Security and Security Metrics . . . . . . . . . . . . . . . . . . . . . 130 Finance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Commonly Used Business Valuation Metrics. . . . . . . . . . . . . . 132 Metrics Relevant to the Database . . . . . . . . . . . . . . . . . . . . . . . 133
Contents | vii
Cloud Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Cloud Monitoring Solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Splunk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Chapter 6: Optimized DevOps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Monolithic Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 The Process Layer—Waterfall, Agile, and CI/CD. . . . . . . . . . . . 147 How Does the Cloud Enable Us to Transform DevOps?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 The New Application Architecture. . . . . . . . . . . . . . . . . . . . . . . 152 Automated Cloud Deployment and Building Blocks. . . . . . . . . 156 Application Deployment Frequency. . . . . . . . . . . . . . . . . . . . . . 162 Deployment Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Part III: Commercial Attributes Chapter 7: Conventional Cloud Services. . . . . . . . . . . . . . . . . . . . . . . . . . 179 Cloud Computer Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Cloud Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Cloud Database Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 AI/ML, Blockchain, IoT, and the Cloud . . . . . . . . . . . . . . . . . . . 188 TaaS – Test as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Business Continuity, Disaster Recovery, and Backup—Cloud Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Backup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Chapter 8: Cloud and Quality Management. . . . . . . . . . . . . . . . . . . . . . . . 207 Essential Characteristics of Cloud Quality. . . . . . . . . . . . . . . . . 207 Cloud Quality Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Automated End-to-End Tests in Production . . . . . . . . . . . . . . . 213 Alerting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Cloud Standards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Cloud Quality Assessment Models. . . . . . . . . . . . . . . . . . . . . . 214 Chapter 9: Cloud Practitioner Best Practices . . . . . . . . . . . . . . . . . . . . . . 216 Summary of Cloud DevSecOps. . . . . . . . . . . . . . . . . . . . . . . . . 216 Common Challenges of Cloud DevOps. . . . . . . . . . . . . . . . . . . 222
viii | Contents
A Cloud and DevOps Checklist . . . . . . . . . . . . . . . . . . . . . . . . . 231 Hybrid Cloud and Multicloud—Choice or Necessity? . . . . . . . 233 Suggestions for Advancement of the Cloud . . . . . . . . . . . . . . . 237 AI/ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 The Future of Cloud Computing: The Next Journey . . . . . . . . . 239 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Preface
C
loud and cloud computing probably are the most trending buzzwords in the technology world today. We will first talk about the cloud, and then cloud computing. Cloud, a term that was once used mainly by technology geeks, today has become a household word. Almost everyone with computing access has benefited from cloud-enabled ecosystems in one way or another. The importance of the cloud infrastructure gained greater prominence at the height of the COVID-19 pandemic. Many countries were quickly able to shift all essential facilities, such as education and healthcare, to a virtual mode due to efficient frameworks put in place by cloud technologies. Even though the cloud has come into the limelight in more recent years, the principles of distributed computing are an age-old topic. People were unknowingly envisioning a cloud infrastructure even as early as the 1950s, which is almost the same time that computers were invented. As quality professionals, we have been working in many of Silicon Valley’s premier technology companies for more than two decades. During this time, we all have served in various leadership roles and capacities, including in the areas of software engineering, customer assurance, data center management, and cloud transformation. In addition, we have worked on the development of some of the early phases of technologies such as IPv6, high availability, Internet Protocol (IP) tunneling, blade servers, and distributed computing—all of which had a profound impact on the development of cloud computing. However, working on these technologies has made us aware of the deep-rooted, complacent nature around the cloud and the ignorance of its various fundamentals. The lack of knowledge about these fundamentals often translates into the parent company experiencing the direct loss of time and resources. Although most cloud users claim to understand terms like Platform as a Service (Paas), Infrastructure as a Service (IaaS), Network as a Service (NaaS), and Software as a Service (SaaS), we have realized that this is not the case. Our experience has shown that most designers and decision-makers do not ix
x | Preface
have the full picture of what they are getting into while attempting to use cloud infrastructure. Through this book we attempt to build an end-to-end decision-making guide for cloud transformation. Cloud computing is not a new technology; rather, it is a combination of existing technologies and utilities bundled to make their access and use flexible and easy. Knowing these technologies and how they shape the cloud is essential for all practitioners of the cloud, from engineers or top-level decision-makers. The cloud has evolved through several iterations, including some early failures. The first major contribution to the cloud infrastructure came about due to innovation in the area of distributed computing and peripherals. What we know today as the cloud engine or cloud machines and the now popular cloud storage are all successors of these advancements. The second contribution was developments made in virtualization and high availability. By virtualizing the hardware, the major obstacles of location dependency and physical constraints were removed, which greatly helped the spread of the cloud. The third was the evolution of Web 2.0, application programming interfaces (APIs), and microservice-based programming. By overcoming the limitations of monolithic applications and waterfall programming, the application development suddenly shifted from local development to a style of development where people can contribute from anywhere in the world. Another development, service-oriented architecture and asynchronous transactions with representational state transfer (REST)–based APIs, paved the way for serverless computing. Advancements in utility computing and edge computing added several opportunities to the cloud, including the Internet of Things (IoT). With unlimited expansion potential, areas like artificial intelligence/machine learning (AI/ML) started finding new ways to grow in the cloud. We discuss these transformations throughout the book so the reader can connect to the “cloud story” and gain a greater sense of appreciation for these technologies.
How to Use This Book This is perhaps the most comprehensive book you will find on the topics of cloud computing and development. The text provides in-depth knowledge of the information needed to understand the cloud infrastructure and associated details, which can enable readers to implement and transition to the cloud to address problems and successfully deploy techniques to improve the customer experience. The focus of this book is not on a specific cloud provider, but rather on the underlying technologies that make the cloud what it is today.
Preface | xi
We have organized the book into three sections: process, technology, and commercial attributes. The decision to move to the cloud is driven by factors other than technology; the process and total cost of ownership (TCO) are important considerations as well. The first three chapters of the book cover the journey of the technology from the old-fashioned, on-prem data center to modern-world cloud offerings by commercial entities known as cloud service providers (CSPs). We have taken a balanced approach, using examples from the three major CSPs: Amazon Web Services (AWS), Google Cloud Platform (GCP), and Azure. While there are several published training courses and tutorials available, we hope that readers will benefit from our experience in converting the theories into reallife scenarios. Chapter 4 provides an overview of cloud operations. Chapter 5 takes a closer look at cloud monitoring and metrics. One of the biggest concerns of the virtual ecosphere is the absence of physical proximity to the devices and network. That invisible domain often creates a sense of panic in the minds of network designers and troubleshooters who were trained by the visible indications of the red, green, and orange lights on the back panel. Therefore, some of the primary offerings from CSPs are related to the detailed and transparent monitoring and measurement of everything going on in the cloud. The realm of programming and applications is heavily dependent on the underlying hardware, network, and peripherals that connect them all together. Traditionally, these applications were designed to operate on a fully loaded server that contains all of the resources required for it to run smoothly. We guide the reader through a journey from the monolithic world all the way to the latest microservice and serverless world. We are sure that the knowledge gained from this journey will have far-reaching impacts on the mindset of a cloud practitioner. There are many hybrids and intercloud deployment, and it is important to know how various providers use these hybrids to offer their buildups. Chapters 6 and 7 discuss various cloud providers and how to build and publish an application on the cloud. We also present a view on how cloud providers compare with on-prem data centers and even how the offerings of these providers themselves compare. Parts of Chapters 7 and 8 provide an overall view of how an application can be built from the bottom up and used in all aspects of the cloud. Building applications also involves utilities like IoT and AI/ML, as well as how they fit into the cloud paradigm. Chapter 9 includes best practices for cloud practitioners. As you consider the path forward to full cloud implementation, refer to this chapter for practical guidance.
xii | Preface
For a cloud enthusiast it is important to learn how peripheral utilities and various applications are gaining from the cloud. We have also touched on the importance of quality standards in the cloud. Getting certified in the use of a specific CSP is not an indication of a true cloud designer; rather, it raises awareness of the mode of packaging for that provider. What really matters is the technology that the CSP employs. Buzzwords like virtual private cloud (VPC), virtual machine, high availability, encryption, packet tunneling, and pub/sub already exist in the noncloud environment, but knowing why and how to use them effectively in the cloud is what brings real success. We hope that the book takes readers on a journey that continues, and through the knowledge they gain, we are sure that they will be ready for the next step in the evolution of the cloud.
Acknowledgments
W
e have been fortunate to have several subject matter specialists on the cloud who dedicated their time and expertise and have conducted industry research to make the contents of this book robust. We would like to recognize the following people:
• Deepak Bhaskaran, for his extensive contributions in Chapters 2 •
and 8 Rony Joseph, Viral Patel, Swaraj Vinjapuri, Thobias Vakayil, Manish Mehrotra, Animesh Sahay, Maggie Xu, Roshni Francis, and Shobha Deepthi for their advice and feedback on Chapters 5, 6, 7, and 9
These individuals provided case studies, detailed walkthroughs, and technology references that are used in cloud transformation. We also gained practical input from the aforementioned specialists and other industry experts regarding artificial intelligence/machine learning, Internet of Things, and blockchain. This book would not have been possible without the leadership and ongoing support of our publisher, Quality Press, and the Quality Press team.
xiii
Acronyms
• AI—Artificial intelligence • AI/ML—Artificial intelligence/ machine learning • APM—Application performance monitoring • AWS—Amazon Web Services • BPM—Business process management • BU—Business unit • CASE—Computer-aided software engineering • CBA—Cost-benefit analysis • CLV—Customer lifetime value • COGS—Cost of goods sold • COQ—Cost of quality • CR—Continuous regression • CSA—Cloud Security Alliance • CUD—Committed use discount • CVE—Common vulnerabilities and exposure • DDoS—Distributed denial of services • DevOps—Development and operations • DevSecOps—Development, security, and operations
• DNS—Domain name system • EA—Enterprise agreement • EBS—Elastic block storage • EC2—Elastic compute cloud • EUM—End user monitoring • E2E—End to end • FaaS—Function as a service • GCP—Google Cloud Platform • GCE—Google Compute Engine • GCS—Google Cloud Storage • GPU—Graphics processing unit • HTTP/HTTPS—Hypertext transfer protocol/Hypertext transfer protocol secure • IaaS—Infrastructure as a service • I/0—Input/output • IP—Internet protocol • ISP—Internet service provider • IT—Information technology • ITOA—IT operation analytics • ITOps—Information technology operations • JAD—Joint application development • KMS—Knowledge management system xv
xvi | Acronyms
• KPI—Key performance indicator • LAN/WAN—Local area networking/wide area networking • LSD—Lean software development • MFA—Multifactor authentication • MTBF—Mean time between failures • MTTR—Mean time to repair • MVP—Minimum viable product • NFR—Non-functional requirements • NPOS—Net promoter score • OU—Organizational unit • PaaS—Platform as a service • PoS—Point of service • PPA—Private pricing agreement • RAD—Rapid application development • RAM—Random-access memory • RDS—Remote desktop services/ relational database services
• REST—Representational state transfer • RI—Reserved instance • SAN—Storage area network • SaaS—Software as a service • SDLC—Software development life cycle • SDN—Software-defined networking • SMB—Small and mid-size business • SOA—Service-oriented architecture • SRE—Site reliability engineering • SSO—Single sign-on • TaaS—Test as a service • VM—Virtual machine • VPC—Virtual private cloud • WAF—Web application firewall • Xaas—X as a service
Part I Process
CHAPTER 1
Business Models
L
ook around you, and you’ll see change everywhere. Roads are wider, office buildings are taller, cars are fancier, customers are more demanding, and the list goes on. In this environment, businesses are trying their best to compete in a global market, as the world is getting much smaller thanks to constantly changing technologies. The internet has put just about everything we need at our fingertips, making it easier to compare products, prices, delivery times, and warranties. Today’s consumers are becoming savvier with their demands, which are often carefully planned and well thoughtout. There is nothing constant in life, so how can we expect our customers and their ongoing needs to be constant? Customers really don’t care where a product was manufactured or the types of processes that were used to make it, as long as the product works as promised and advertised. And as long as the product is received on time, happy customers are willing to continue their relationship with the firm, giving them repeat business in the future. The international and domestic competitive trade environment, with its constant pursuit of efficiency and agility, can limit flexibility—unless you pay close attention to various business models and select one, or a combination of them, to capture market share in order to raise your company’s brand awareness. There are several business models in the market, such as:
• Adaptability • Differentiated process • Drift management • Democratic management • Smart market penetration • Market penetration pricing 3
4 | Successful Management of Cloud Computing and DevOps
However, the core process digitization model is most relevant for the topics that we are discussing in this book.
Core Process Digitization Model Digital technologies are used to alter the way that you operate a business model by creating innovative openings that result in additional revenue streams and value-producing prospects. Agility and responsiveness are essential to companies, now more than ever, as they encounter competition domestically and internationally in price wars. By digitizing your core processes, you are moving to a digital business that allows you to gain competitive advantage by addressing projects faster and better, beating the competition by delivering the product to market quickly. Information-intensive, digitized business processes can lead to an increase in profit as a result. Manual document control is time-consuming and expensive, and it limits the progress of workflows. At the same time, manually controlling revisions of documents can be cumbersome and may lead to challenges in process adherence, as well as in meeting customer requirements. The time it takes to find the latest paper copies of documents inhibits creativity. Companies that are conscious about their quality management systems constantly ensure that their employees have the latest policies or procedures to do their jobs. They automate routine analog processes for consistency and cost savings, as well as efficiency. There are various digitization solutions, including companies that offer document-scanning services to organize design documents, contracts, servicelevel agreements (SLAs), and other elements into any digital format, such as .txt, .jpeg, .gif, .docx, or .xls. The techniques of document imaging and scanning help reduce filing time and the time spent locating documents that have been filed incorrectly. The biggest advantage is the savings in physical storage space. We all have photos of family members that are now faded or precious letters that we have saved for years that are now hard to read as the ink has become lighter. Digital documents retain their original color and allow valuable documents to be preserved. The importance of digitizing core processes is often overlooked. However, these core processes may take substantial time and prevent you from becoming a market leader. For example, collecting data on your products from your customers and compiling customer feedback can take a long time. By the time this data is collected and analyzed and a decision is made, your customers may have moved on to another company or their needs may have changed. Therefore, nowadays companies are combining artificial
Business Models | 5
intelligence (AI) and digitalization to obtain higher-quality customer insights and business results. Core process digitization differs from company to company. Some companies have detailed processes with inputs, outputs, records, and policies, while others have only an overall concept. The entire list of possibilities can be exhaustive. Regardless of the number, however, you need to consider the entire ecosystem that makes up your business and includes various critical infrastructures. For each one, you also need to identify individual workflow steps and data collection needs. Here is a list of seven business substructures that are good candidates for digitization:
• Customers: Customer relationship management (CRM); service• • • • • •
level agreements (SLAs); capturing customer-found defects; customers’ buying patterns Products: Development (includes descriptions, designs, test plans, validation and verification, delivery plans, sales, and warranties); end-of-life process Manufacturing: Inventory reports; bill of materials; production orders; architectural drawings; designs and tests; on-time delivery reports Suppliers: SLAs; supplier quality reports; supplier defect reporting; supplier audit reports Employees: Knowledge management systems (KMSs); hiring; onboarding; payroll; performance evaluation; vacation; termination Finance: Operation life cycles, including profits and liabilities Compliance: Regulatory requisites, identifying each entity and associated requirements; compliance processes with test results
Planning for Digitization Know that once you have digitized various workflows within different infrastructures, the job is not yet done. Success is achieved when you have examined how the digitized workflows are linked together and interact with one another as a whole, enterprise-wide digitized system. The manual documents you may have and how you will create, edit, and make digitization efficient to increase productivity and provide a positive customer experience are important considerations. Due to the need to interlink various processes, consider including staff from cross-functional departments to provide feedback about end-to-end customer experiences. This may lead to establishing a team of subject matter
6 | Successful Management of Cloud Computing and DevOps
experts from each department. You may also want to ask data scientists to give critical input on the actionable data to be collected for positive user experiences, and customer experience experts to include vital insights on repeat customer business and customer satisfaction. For customer-related digitization, you may want to review feedback from your customers. Many companies collect this information via surveys, while some also use regular customer meetings and other avenues such as customer conferences and customer councils. While developing workflows for digitization, including some of these opinions and feedback will increase customer satisfaction and customer loyalty. Many companies have come to rely on smartphone technology, which offers accessibility to their customers. Mobile apps offer customers easy access to products, sales staff, and customer support via smartphone. Mobile interactions offer a smooth interface and give customers a seamless experience, making it convenient for them to interact with the company to make purchases or seek technical help. Enterprise-wide digitization of workflows is a cost-prohibitive activity in many organizations. Therefore, you need to develop a well-thought-out plan considering the financial impact and then set aside funds for digitization. The new automated workflows will affect the current information technology (IT) environment. To avoid such a scenario and obtain results more efficiently, many companies use cost-effective methods of utilizing offshore expertise by outsourcing workflow digitization efforts. In today’s economy, profit margins are razor thin and for companies to stay ahead of the pack, their business digitization plans should include cloudbased technologies. Later in this book, we provide readers with complete details on these technologies, including operations, metrics, monitoring, deployment, and common standards.
Agile Business Model If you remain static in the business environment of today, any commerce transformations will leave you scrambling to catch up. Economic uncertainties, volatile conditions, the ever-changing pace of business, and constant game-changing technologies can hit you like an avalanche. Bear in mind that the demands of stakeholders such as customers, partners, and state and country regulators are rapidly evolving, too. Each stakeholder is looking for different benefits. For example, if you are working in a public company, then your investors are continuously looking for growth and high return on their investment. The Agile model is a paradigm shift and a lightweight approach to deal with uncertainties by companies changing from traditional
Business Models | 7
management to management with Agile practices and business development. The combination of Agile practices and business development allows you to react to market or global unpredictable vicissitudes faster by helping you to create economical advantages and maximize performance. An entire company’s people-centered culture allows employees to take charge and readily adapt to any fluctuations by letting go of old habits and embracing new strategies and associated policies. They design and develop foundational elements that slowly evolve and support additional dynamic capabilities that can adjust to new challenges and opportunities. The elimination of nonfunctional and nonvalue-add habits will promote efficiency and yield better throughput. The philosophy is to operate in fast decision cycles and embrace rapid learning enabled by technology. Since employees know the processes, they are best positioned to contribute to altering them to achieve common value for stakeholders. It is often said that being Agile is not about process, but about people. To learn quickly and shed old habits requires humble and open-minded individuals who are willing to identify their mistakes and design solutions that address them. They also need to have the courage to explore new ideas and persevere until their efforts prove to be successful. The Agile model emphasizes both speed and flexibility—the speed with which you adopt a change and the flexibility to embrace it and improve the company’s “muscle memory” when it comes to adaptation. Employees should be equipped with the right tools and training, which clarifies for them the age-old question, “What’s in it for me?” Once they understand the benefits of a change, and that they will be able to accomplish their job more easily, they will be more receptive to using the new tools. Ongoing internal communication by the leaders will help build cultural alignment and the reason behind the implementation. As the Agile model adoption proceeds, management and leaders need to know how the new ways will affect the overall organization, and plans must be developed up-front to incrementally make changes throughout, ultimately revolutionizing the way of working. Success will depend on whether the leaders of your company plan the entire Agile conversion in detail with a strategic approach. The first and foremost step is to identify the reasons for the transition and to have answers for basic questions such as whether they are adopting this model to meet customer demands, or to beat the competition, or to release products faster in the market. Once these questions are answered, the leaders need to communicate them to the employees to get them on board with the idea. If everyone is not on the same page, Agile transformation is likely to fail, and spending ample time on planning and strategic thinking will save you hours of grief later.
8 | Successful Management of Cloud Computing and DevOps
Lean Model The lean model presents a strategy to eliminate or reduce waste in processes and products, while at the same time satisfying customer requirements and needs. The main principles of lean are:
• The minimization (or even elimination) of waste • Products built with quality in mind • Focus on overall improvement • Creation of knowledge • Emphasis on fast delivery The lean principles originally were emphasized in manufacturing to optimize the production line to reduce waste. The idea is to maximize customer goodwill by developing high-quality products faster; adhering to core processes and bringing products to market more cheaply by being conscious of the associated expenses is key. Lean is a performance-based methodology designed to promote efforts, individuals, and workflows that are centered on customer demands by being mindful of waste from unnecessary and cumbersome activities. Together, augmentation in all three areas will result in better value for customers. The guiding principles are to respect individuals and focus on continual process improvements in which every employee is actively involved.
Lean Software Development (LSD) Lean software development (LSD) focuses on applying lean principles to software development methodology to maximize the efficiency of the design and writing of code, reduce the occurrence of defects, and help eliminate flaws throughout the development life cycle. The technique involves getting information directly from the customer, thus eliminating any potential errors generally captured by traditional specifications and eliminating steps that result in wasted effort. It does not allow ideas or projects to be started without understanding the customers and their requirements first. Granted, every team wants to develop high-quality products. However, this is easier said than done, and even with good intentions, many teams end up creating code that has far too many defects, and in some cases, has to be completely rewritten. The defects have to be fixed and retested, wasting both time and effort. Improving overall efficiency is a mindset that forces teams to do strategic planning in incorporating steps such as reviewing code, writing detailed test plans, and ensuring that the product is built right the first time rather than spending additional hours on extra validation and testing. The emphasis of the lean approach is to document learning with the purpose of using it for future projects and creating knowledge for the future.
Business Models | 9
Rather than overengineering, the concept of lean emphasizes delivering a fast solution and upgrading it incrementally with feedback from customers. In some ways, LSD is similar to Agile development, where short iterations provide the opportunity to build products faster with collaborative teamwork and open communication, allowing teams to fix any potential defects quickly so products can be released in the shortest amount of time. This also has side advantages of amplifying learning from past development mistakes and correcting them in subsequent iterations. The principle of late decision-making in LSD is critical, and it allows changes to occur right up to the last minute, so it is cost effective to make changes to a product. Any new potential feature does not have to wait for the next version of the product, which may be five to six months away (or even later). In LSD, since changes can be incorporated immediately, the feature can be added right away. The automation of processes really pays off positively in LSD, as it eliminates manual intervention and promotes consistency of product quality, including the use of computer-aided design and testing of the code.
Software Development Life Cycle (SDLC) Much has been written regarding various software development life cycles, also known as application development life cycles. It is said that planning how to do projects is equally as critical as doing them. A software development life cycle (SDLC) identifies the activities that need to be carried out and outlines the associated deliverables. There are several benefits to having an SDLC. It:
• Provides consistency among all the software projects that are • • • • • • •
developed by defining the activities and deliverable of each phase Reduces development time since the activities, related templates, and approval processes are already defined Minimizes inefficiencies and reduces the overall overhead and other costs Assists in creating a product that meets customer requirements Helps the business to outshine its competitors, which may be overburdened with needing more time or cost to do the same type of work Makes it easier to plan, schedule, and estimate a project Facilitates the approval process since the stakeholders may already be defined Increases the quality of the deliverable, as the process and content are predefined
10 | Successful Management of Cloud Computing and DevOps
• Reduces risks in the event that there is a deviation with the process •
and approvals are required Helps with process changes and user experience management, along with adherence to company policies
Waterfall, parallel, spiral, rapid, and other less common SDLCs dominated software development for a long time. As customers and their needs evolved, new development methodologies started bombarding the software industry. Let’s briefly highlight these more traditional models as a refresher.
Waterfall The waterfall model is the most traditional model, in which each phase must be completed before the next phase can begin and there is no overlapping. It was the first process to be introduced, and it served the needs of customers for years. Project activities are organized into linear, sequential phases, such as the following:
• Market analysis • Planning • Requirements • Design • Coding and performing unit tests • Performing system tests • Performing user acceptance tests • Product releases • Maintenance Each phase of the project proceeds in a downward flow, like a waterfall (hence the name). Each phase must be fully completed before the subsequent phase can begin. The output of one phase becomes input for the next phase. Generally, there are entrance criteria and exit criteria with each phase as a quality control procedure to ensure that all required deliverables, along with proper approvals, are present before the team moves to the next phase. Reviews/ approvals of the requirement document, design, coding, test plan, and test results play a vital part in ensuring the needed quality throughout the project.
Parallel The software development process is linear when one version is completed and the next version is derived from the original, with upgrades or fixes. The linear process makes configuration management easy with each new version. There are cases where after the start of the project, new functions or additional requirements from customers are added, which can cause delays.
Business Models | 11
The problem of delay between the requirements phase and final delivery of the product is addressed by the parallel development methodology. Design and implementation are not done in sequence; instead, when there is a need, a general design is completed for the entire system, which is then divided into subprojects, each of which has a separate development path that diverges from the common general design. In this case, additional new configurations, called variants, are designed and implemented in parallel. Instead of following a linear path where coding is addressed chronologically, various features of the code are worked on simultaneously by different teams. When all the subprojects are completed, they are merged into a whole system, which then is tested and released to customers. In a parallel life cycle, the phases of the development overlap and there are more handovers of interim deliverables and more communication within the team prior to completion of the project. Proper branching and strategic planning in the merging of variants and unit and system testing help to make the product robust. Having an appropriate design is definitely needed for success with the branching model.
Spiral The spiral is a combination of the waterfall with prototypes, which is a riskdriven incremental development methodology. The adoption of multiple process models is facilitated according to associated risks. The waterfall is a sequential development model and prototype where various iterations are developed (spiral) until the entire system is built. The spiral model has four phases: planning and requirements, designing, coding, and testing. The phases are repeated for every iteration (spirals), offering the opportunity to learn from one spiral to the next, minimizing the risks of development, and avoiding costly errors. The initial spiral model starts with a small number of requirements that go through the abovementioned four phases. In each subsequent spiral, more functionalities are added until the entire system is developed. There are several advantages to following the spiral method. It is useful when the project is large or the requirements are unclear, are complex, or change frequently. Small incremental prototypes make it easier to manage quality and conduct risk analysis.
Rapid Application Development (RAD) Rapid application development (RAD) is similar to Agile development, where rapid prototype releases are prioritized and worked on. This methodology encourages progressive development, speed, low cost, and quality.
12 | Successful Management of Cloud Computing and DevOps
It helps in the use of computer tools such as computer-aided software engineering (CASE) and joint application development (JAD) to analyze, design, and produce programs quickly from design specifications. In RAD, various functions of the system are developed in parallel as miniprojects, and incremental developments are delivered according to speedy feedback and a preestablished schedule, and then gathered into a working prototype. The model has the flexibility to allow you to easily envision a final solution and develop multiple iterations of the software quickly, and it eliminates the need to start developing from scratch each time.
Imperative Business Strategies Developing business strategies is an absolute necessity in order for you to examine how your business is performing and to understand its strengths and opportunities. The key strategies for business include time-to-market, inventory control, meeting customer expectations, keeping up with market expectations (such as monitoring the stock market), total cost of ownership (TCO), acquisitions, mergers, business diversification, and workplace mobility, all of which help an organization to sustain itself. Like cloud computing or e-commerce, workplace mobility has a huge positive impact on your business. It can assist in creating an innovative business framework and give your company a competitive advantage. One of the biggest changes currently is for organizations to determine whether they want to have their own data centers or have their database systems managed by a third-party cloud service provider. This topic is discussed in Chapter 2.
CHAPTER 2
Data Center
E
nterprises run computing resources using different models. There are three broadly accepted paradigms, with some variance among them. In this chapter, when we use the term traditional paradigm, we are referring to the following:
• Data centers • Colocation • Managed hosting Traditional Data Centers The term data center brings to mind endless hallways with tall stacks of blinking servers piled up neatly to the ceiling. You show identification at the front desk, swipe your access card at the elevator, and reach the floor on which your corporate data center is hosted. Swipe your card again at the main door to the server room, and you can go in now; you see row after row of blinking blue lights from stacks of servers; you are surrounded by the crown jewels of the company you work for. You are in the data center!
What Is a Data Center? A data center is a physical facility where companies store their data and run their applications. It is the central location from which your organization’s computing power is delivered to customers and employees. This structure could be as big or as small as the needs of the organization that it supports. Also, the facility could comprise real or virtualized hardware or, more recently, cloud-based hardware resources. In 1946, one of the earliest examples of a facility similar to the data center was built in the United States, for the Electronic Numerical Integrator and Computer (ENIAC). The ENIAC was just one computer, but it was huge. 13
14 | Successful Management of Cloud Computing and DevOps
It used about 18,000 vacuum tubes, weighed 30 tons, was roughly 2,352 square feet in size, and used 150 kW of energy. The facility built to house the ENIAC had all the elements of a data center:
• Monitoring of the vacuum tubes, and procedures established to • •
replace them when they burned out Systems to ensure that the power supply was sufficient and reliable Temperature management solutions to keep the ENIAC cooled
Things have changed somewhat since that time, but not completely!
The Role of a Data Center We often hear that businesses are going digital or undergoing digitization. In Chapter 1, we talked about core process digitization. In a pre-cloud world, the traditional data center was the venue that offered digitized services. In the old days, the human resources (HR) department maintained all employee files on paper. That system was hard to manage and hard to search through, and it was difficult to prove that the department was complying with regulations. If all employee data were to be digitized, they would be stored in the company’s data center and accessed by users through applications run from there. The HR staff could access employee information from their desks, without needing to go to a basement storage space, walking through aisle after aisle of alphabetically arranged files, and accessing what they wanted.
What Goes into a Data Center? Data centers have servers that process data, databases for data storage, networking devices that make connectivity available, and security devices that keep everything secure. The specific needs of the business determine the size and composition of the data center. A good place to start is the amount of computing capacity needed, the number of applications or services that are going to be hosted, the number of users accessing them, and their usage patterns. If the business is a huge, nationwide behemoth with thousands of employees and applications, its data center should be resourced and set up accordingly. Another consideration is the nature of the computing capacity needed. It is good to know about the applications that are to be hosted in the data center and the nature of the computing that they will perform. If the applications require lots of data to be written and read, or they require many mathematical calculations to be performed, the data center is designed accordingly. It also helps to know where the people using the service are going to be located. This is key when determining the right site for your data center.
Data Center | 15
Knowing the business context of the services that are being offered will also help. If these services are meant to operate only from 9 A.M. to 5 P.M., for example, the resourcing needs are going to differ compared to services that operate 24 hours a day, 7 days a week. The security needs of the business and the regulations it must comply with are important to consider as well. The higher the security needs, the higher the maintenance costs of operating the services. Likewise, if the services are operating in a highly regulated industry, they may cost more than in a lightly regulated industry. Understanding the financial constraints is important to building a data center that matches the needs and constraints of the business that it is going to support. Another key question to ask is whether the business is better served financially by building a data center from scratch or by buying an existing data center.
On-Site Data Center In the on-site data center model, an enterprise runs all its computing resources on site (i.e., within an office building) or leases or purchases a separate data center. An on-site data center usually has more space, cooling, and power than an enterprise needs, as it is built to deal with peak demand. In this model, the enterprise is responsible for sourcing, procuring, designing, building, and operating everything from the data center to the applications. The management of a data center also requires on-site personnel and frequent repairs, replacements, and upgrades, all of which are expenses that add to the ongoing capital expenditures. There are two fundamental challenges with data centers, as described below.
Capacity Planning The needs of the business, current and future, play a big part in capacity planning for the data center. The core needs and future growth prospects of the business are key inputs to the capacity planning process. It is important to start with a good understanding of these needs and then periodically reevaluate and adapt accordingly. The needs of a large, IT–centric business are different from those of a large, non-IT–centric business. Similarly, the needs of a small but rapidly growing firm are different from those of a small company whose growth prospects are uncertain. A small firm whose growth is certain may want to leave room to accommodate future growth. If those growth plans are not certain, you may not want to spend too much time elaborating on the scalability of the data center space.
16 | Successful Management of Cloud Computing and DevOps
Another thing to keep in mind is the needs of the business and the seasonality of its fluctuations. Take into consideration the following points:
• The number of users expected to use the service on a normal day • The magnitude of their service requests in terms of depth and • •
speed Without any external events, how high the number of service requests can go What happens during special events, peak hours, or the holiday season
If the business is planning to do a huge marketing push that might drive a lot of user traffic to its website, that needs to be factored into capacity planning to avoid any service disruptions. One example of a service disruption was the crashing of the J.Crew website on Cyber Monday 2017, when there was a major sale event going on and the severely heavy traffic driven by the sale caused the site to crash. On the other hand, if you are constantly running out of capacity, you are forced to make additional expenditures. In either case, you will be spending a lot of capital. Few enterprises can size their data centers to meet their capacity demands, so they almost always end up with excess capacity.
Disaster Recovery and Business Continuity Business continuity planning (BCP), backup, and restore strategy, along with disaster recovery (DR) planning, are important regardless of where they occur, whether in the cloud or on premises. Some of the basic challenges of doing these tasks on premises can be remedied by going to the cloud, but we will take a close look at what trade-offs we need to make in the cloud. Important aspects of cloud computing that we will discuss include the following:
• How stable is the cloud? • What happens if one module of a program stops responding? • What about data security? • How do you back up computer data? • How do you protect and recover data? • What are your disaster recovery plans? Continuous operations provide you with the ability to keep things running during an accidental disruption of service, as well as planned outages such as scheduled backups or planned maintenance.
Data Center | 17
DR and BCP are techniques often employed in cases where there cannot be any downtime. These environments need to be kept at a location different from the main data center and should be kept in such a state that they can take over operations and provide continuity for customers. Having the entire computer infrastructure running from a single data center often does not meet SLA requirements and is very risky for the business. Hence, enterprises must have a second data center for disaster recovery and meeting SLAs, doubling the cost. Imagine a bank that provides online money transfer services to its customers. If it has a production environment housed in a data center in New York, which is the main location for serving customer needs, the bank needs to have another data center in another state, called the BCP or DR site. This alternative center will be synchronized with the production environment at frequent intervals to ensure that it can take over offering services without much lag or disruption. If an earthquake or an undersea accident takes out the main internet backbone connecting New York to the rest of the web, the data center in the other state will be brought online and customers will seamlessly be served. The BCP is thus an important document that you hope you never need but cannot get by without (similar to your health insurance!). BCP is generally developed by the operations and engineering teams and reviewed regularly for accuracy and timeliness. The possibility of the BCP document becoming outdated with every release and new feature is very high, so we recommend a mandatory policy to review it on a regular basis. At the same time, there should be a policy to execute the entire plan regularly in DR drills. The drills will confirm whether following the BCP without any deviations allows you to offer services from the DR site without encountering any glitches. When considering the location for a data center, some organizations give preference to wide-ranging geographical locations, while others choose something closer to the corporate headquarters for convenient physical access.
Choosing a Site for the Data Center You may want to consider a number of factors during the construction planning phase before choosing a site for a new data center facility (see Figure 2.1 on the next page).
18 | Successful Management of Cloud Computing and DevOps
•
Labor availability and costs
•
Access to transportation
•
Real estate availability and costs
•
Geopolitical stability
•
Risk of natural disasters
•
Access to the power grid
•
State or county compliance
•
Government incentives
•
Budget considerations
•
Regulatory considerations
•
People
Figure 2.1 Key considerations for choosing a data center site.
Labor Availability and Costs One of the most important considerations is the staffing needs for the data center and whether it will be possible to hire qualified professionals in the area at a cost that is justified for the business. The costs and challenges of hiring qualified individuals can drive the decision of the location. In recent months, there has been a large movement of both tech industry professionals and companies to cities such as Austin, Texas, where real estate is much cheaper than in Silicon Valley, California. The expense of hiring a Linux administrator, for instance, is higher in the San Francisco Bay area than in a city such as Pittsburgh, Atlanta, or Cincinnati.
Access to Transportation Being near major transportation hubs is a convenience that cannot be overlooked. Selecting the right location is a critical element of data center planning, as the cost of easy access to transportation affects the TCO. For example, if your organization has an international presence, easy access to airports for travel is necessary. Since the data center is expected to be easily accessible during a natural disaster, getting to it, either via air, trains, roads, or public transportation, is necessary.
Real Estate Availability and Costs The cost of real estate can wreak havoc on the financial bottom line when it comes to location decisions. Is enough land available for you to build the
Data Center | 19
data center that your organization needs at an acceptable return on investment? In a city where prime real estate prices are doubling every two years, how much land do you want to purchase to lock in today’s price as opposed to being subject to the vagaries of the real estate market? These are decisions that senior management needs to make with the sustainability of the business in mind.
Geopolitical Stability Geopolitical stability is another major factor when it comes to constructing a data center. Politically unstable areas with civil unrest can damage the network infrastructure, and in the worst-case scenario, the government may cut off internet connectivity to these areas. Since the data center is a critical component of your operations, its grinding halt would put your entire operation at risk and negatively affect the customer experience, as well as the company brand. Regions with territorial conflicts are always under the shadow of potential war. Taking this into consideration will help prevent any future risks.
Risk of Natural Disasters Natural disasters have to be taken into consideration, on top of power, heating and cooling, and security demands. We simply cannot ignore earthquakes, hurricanes, and floods, which may cause your company to fail via the loss of valuable customer and other business-critical records. In general, it is better to study the region that you are considering and how it could be affected by natural disasters. Knowing that a region is prone to frequent earthquakes could mean that you have to factor in additional reinforcement of the structures that house your data center, or else you rule out that particular location from your potential site list altogether and find another, more stable location.
Access to the Power Grid Data centers are a different type of real estate, as they have a higher demand for power than traditional structures due to the need to run and cool thousands of servers on a continual basis. Power is not uniformly available across the world, and the cost of energy has become prohibitive in many areas. There are some regions where you will have better access to uninterrupted power sources at a reasonable cost and others where you might have to pay a premium for these services. Having access to sufficient power to keep the servers running for 10 days or more becomes a significant business need should the primary energy source go down. The expense of utilities can become a deal-breaker when choosing a site.
20 | Successful Management of Cloud Computing and DevOps
State or County Compliance Some states and counties have stricter regulations for companies than others. Some may be extremely environmentally conscious and have stringent guidelines concerning the carbon footprint of your facility, while others may have looser standards. Compliance requirements can make a significant difference in your operating costs. For example, there may be personnel training costs related to the adherence to compliance requirements, which increases expenses.
Government Incentives Some states are business friendly and offer a number of incentives for firms to establish themselves there, including tax breaks for organizations that operate and hire employees from within the state. Tax credits for the money that you invest in a particular state based on the employment you create there provide a financial incentive for your organization to choose that state over others.
Budget Considerations The amount of capital available to set up and operate a data center can limit the choices that can be pursued. The decision to build, buy, or rent a facility depends on the cash flow situation of the business. With proper financial planning, you can make a decision to build immediately and claim depreciation over several years or make regular monthly payments without having to incur the upfront costs of construction. Owning the entire facility versus using a shared facility is another choice that is affected by the available budget.
Regulatory Considerations There are regulatory considerations for data center design that are affected by the domain where your business operates. Some examples of such regulations are the following:
• The Health Insurance Portability and Accountability Act (HIPAA)
•
of 1996 is a federal law that developed national standards to protect sensitive health information from being disclosed without a patient’s knowledge or consent. The HIPAA Privacy Rule was issued to implement the law’s requirements. The entities covered by this rule include healthcare providers, plans, and clearinghouses. The Payment Card Industry Data Security Standards (PCI DSS) were developed to secure cardholder data. It applies to all
Data Center | 21
•
entities that are a part of payment card processing. This includes merchants, processors, acquirers, issuers, and service providers. It also includes all other entities that store, process, and transmit cardholder data (CHD) and/or sensitive authentication data (SAD). The General Data Protection Regulation (GDPR) is Europe’s data privacy and security law. It was drafted and passed by the European Union (EU), but it applies to organizations anywhere that process the personal data of EU residents. The concept of data processing is used in the broadest possible sense. If your organization collects, stores, transmits, or analyzes information related to an EU resident’s name, e-mail address, Internet Protocol (IP) address, height, weight, political affiliation, or other characteristics, GDPR applies.
If the business handles customers’ personally identifiable information (PII), you need additional controls in place to help achieve PCI DSS compliance. If the business is in the healthcare field and works with patient data, HIPAA compliance is mandatory. If the business operates in Europe, you must comply with the GDPR.
People A data center needs teams with a variety of skills, including the following personnel:
• People providing physical security and software security • System administrators who manage the IT infrastructure and the • •
software that runs on that infrastructure to meet business needs IT teams monitoring IT services and infrastructure and responding to issues Compliance team members who ensure that the data center and its operations comply demonstrably with standards such as SOC2 and ISO 27001
Even though companies have automated many of their processes, people are still needed to operate data centers. Automation has certainly made certain processes more efficient and less time consuming, but hiring the right people is the starting point. Then you must begin training and onboarding them into the organization; providing new employees with a 30-60-90-day plan to help them learn what is expected of them and providing related training is useful. Pairing up new hires with mentors for the first few months helps them navigate your organization without feeling overwhelmed. It also accelerates their transition to becoming contributing, high-performing members of your team. Additional training for managers helps them understand the
22 | Successful Management of Cloud Computing and DevOps
mission and vision of your company and makes them effective stewards who can lead your employees in the direction of helping the organization meet or even exceed its goals. New technology is being introduced, and existing technology is becoming outdated at a rapid pace. It is difficult for employees to keep up with the latest knowledge on their own, without leadership and management support. It is also in the best interest of your company to be a learning-focused organization and facilitate the updating of skills. Ongoing education helps keep your team’s knowledge up to date with current technologies, policies, and procedures. On an ongoing basis, identify new upcoming technology trends quickly that can affect your company and its operations. One way to do this is to have a research and development (R&D) department for monitoring industry developments. Another is to create training programs with subject matter experts to keep your workforce at the leading edge of your industry by providing training and learning best practices that can be rapidly adopted in your day-to-day activities. Third-party learning solutions are readily available, such as O’Reilly Learning, LinkedIn Learning, and Udemy for Business. These are just some examples of educational platforms that provide access to a broad range of learning resources. All the elements discussed here are trade-offs that you need to consider. Think about whether you would rather be in a state that offers you tax breaks or a state where you can hire people more easily to manage your operations. At the same time, consider whether you would prefer a stricter regulatory climate, so that your intellectual property is better protected, or less strict regulations, which may allow you to operate your company more profitably and grow it faster.
The Benefits and Limitations of a Traditional Data Center The traditional data center model has both benefits and limitations. It was one of the early approaches to digitization and has certainly made a great improvement in the efficiency of companies and their processes, yielding to high productivity and greater customer satisfaction. Traditional data centers unlocked economies of scale by centralizing the computing power of a business and its management. There are certain fixed costs that you will incur regardless of the scale of your business. When you operate at a higher scale, these fixed costs are spread across more entities, redundancy of effort and oversight is reduced, and specialization and better
Data Center | 23
division of labor become possible. This allows you to achieve lower average fixed costs, commonly known as economies of scale. A bigger centralized data center for the organization versus five smaller server labs for each of five individual departments allows your organization to gain the benefits of economies of scale. Further advantages of this approach include the following:
• A traditional data center is easier to isolate and secure than the
•
average corporate headquarters, which is designed to be a more accessible, welcoming structure. Your company’s data remain within the infrastructure that you own and operate, thus giving you more control over the security measures deemed necessary to put in place. Traditional data centers provide complete control over your data and equipment. The business has full authority over the individuals who are given or denied access. You also control exactly what hardware and software are used, and they can be customized to your specific needs.
We have to point out here that there are some limitations of establishing a traditional data center—namely, the effort, time, and resources required during the planning, operating, and maintaining phases. The disadvantages of this approach include the following:
• Setting up a data center and making changes to it are time•
• •
consuming activities. It is hard to accommodate the changing needs of the business once a certain level of investment and effort has gone into creating a certain type and size of data center. The costs of a traditional data center are higher, regardless of the size of your business, as security and regulatory obligations must be met whether you are a large or small company. Earlier, we described various regulations that a traditional data center may need to comply with, including PCI DSS, HIPAA, and GDPR. The cost of these regulatory requirements is an ongoing expense of training individuals, ensuring adherence by occasionally conducting internal audits, and keeping track of noncompliance and addressing related issues until the problem is fixed. It is more difficult for a smaller company to handle these expenses than it is for a larger, more established one. With the rapid pace of technology innovation, hardware and software in the data center are facing the risk of becoming obsolete almost as soon as they are installed. The skills that are acquired through years of experience and are needed to manage and operate traditional data centers are not very commonly available in the market, and classes in these topics are not taught frequently at universities. This makes it harder to find the right people with the needed expertise.
24 | Successful Management of Cloud Computing and DevOps
• In many organizations, the department defining and managing the
•
•
data center, usually the IT team, is far removed from the front-line operations. This makes it harder for the data center to be resourced adequately in order to cope with ongoing changes initiated from other parts of the business. In the colocation model, the enterprise rents space, power, heating and cooling, and other services (e.g., network connection) from a third-party provider. That provider is responsible for the data center’s day-to-day operations, physical security, and additional services such as a network. The enterprise usually enters into a multiyear agreement to utilize the data center’s services for a fixed monthly cost. The enterprise is responsible for the computing resources and the applications that run on them. This model reduces the expenses (building and personnel) related to operating a data center, as well as the associated risks. Colocation offers more flexibility with managing capacity and disaster recovery. The enterprise can rent more space when increased capacity is required and employ multiple locations in order to address disaster recovery. The next evolution of colocation is managed hosting. In this model, the managed hosting provider provides the data center, computing resources (e.g., servers, storage, and networking), and additional services with a management interface to access, configure, and manage these resources remotely. Most of these providers handle the setup, administration, management, and support of computer resources, operating systems, and standard applications (e.g., web server, databases). Management services may vary, with some supporting basic security, including operating system updates and patching, 24/7/365 support, monitoring, and remediation of anything that could affect the applications’ performance. The enterprise generally negotiates multiyear agreements to utilize the hosting service and periodically makes adjustments to address its capacity. This model reduces the expenses (building, computer resources, and personnel) related to the data center, computing resources, and the associated risks.
Traditional data center models still exist today, but other options are available. Imagine a third-party service provider offering a fully managed service with dynamic capacity, providing multiple locations for disaster recovery, and charging by the hour, and for only the services that are consumed. This would drastically reduce the challenges related to capacity planning and disaster recovery. This model is called cloud computing, and we will explore this topic more in Chapter 3.
CHAPTER 3
Overview of the Cloud
E
nterprises must rethink their core business offerings and adapt quickly to new market requirements. IT services are considered a core competency, and enterprises have spent millions of dollars building this competency over the decades. Traditional IT teams cannot continue to deliver growth and profits as fast as the enterprise demands, especially when it is also continually adapting to new technologies and regulations imposed by governments and industries. IT teams will have to do things differently by embracing newer technologies such as cloud computing, which promises faster delivery of enterprise services. In this chapter, we will see how IT staff can benefit from utilizing cloud computing to innovate faster and become more Agile.
Origins of Cloud Computing In the 1990s, telecommunications companies started using the cloud symbol to refer to the customer’s and the telecommunications provider’s demarcation point. Since then, the cloud symbol has represented a thirdparty network service that a user or enterprise employs. Today, the scope of the cloud symbol is extended to include third-party services beyond the network, including remote computer resources such as servers, storage, networking, applications, and interfaces to utilize and manage these resources. These services collectively are called “the cloud.” Figure 3.1 on the next page illustrates the differences between the network cloud in the 1990s and cloud computing today.
Characteristics of Cloud Computing The National Institute of Standards and Technology (NIST) has defined five essential characteristics of a cloud computing environment: 25
26 | Successful Management of Cloud Computing and DevOps
Branch Office
Corporate Headquarters MPLS Network (Cloud)
Branch Office Corporate Headquarters
Storage Server
Network (Cloud Computing)
Branch Office
Branch Office Network Cloud – 1990s
Branch Office Cloud Computing – 2000s
Branch Office
Figure 3.1 Differences between the network cloud of the 1990s and cloud computing in the 21st century.
• On-demand self-service: The ability to register, provision, and use • • • •
resources directly via an administration console (web access) or programmatically without any human interaction with the cloud service provider (CSP). Network access: All capabilities, including access to computer resources, data, and applications, are available via the network using heterogeneous devices (e.g., laptops, desktops, mobile devices) and software such as web browsers. Resource pooling: The computer resources are shared across multiple tenants in a multitenant model. Resources are dynamically assigned and reassigned without consumers having to know or worry about resource constraints or the resource’s physical location. Rapid elasticity: The ability to increase capacity rapidly by adding resources via self-service during peak demand and reducing capacity when the demand drops. Metered service: Usage is on a pay-per-use basis; that is, usage is continuously monitored on a per-second, per-minute, or perhour basis or the quantity used, such as application programming interface (API) calls or gigabytes (GB) stored, and invoiced monthly. Consumers get invoiced for what they used, similar to a monthly electricity bill.
Overview of the Cloud | 27
Cloud Computing—the Benefits and Customer Expectations According to NIST,1 Cloud computing “is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.” Cloud computing needs certain behaviors and policies to operate, which are described as cloud native architecture. Cloud native offers elasticity of resources, a demand-based expansion of assets that make applications to meet the load factor, and allows redundancy by design and cost based on usage. Its flexibility allows optimal usage over a period of time and payment only for space utilized, creating the ideal scenario. It also frees users from long planning cycles and underutilized capacity in the fixed data centers. Third-party service providers (i.e., CSPs like Amazon Web Services [AWS], Azure, and Google Cloud Platform [GCP]) take on all the responsibilities and risks (e.g., physical and logical security) of operating cloud computing services. Enterprises consume cloud computing via the internet. CSPs are responsible for making all the computer resources available via the internet, including software and applications, to provision and consume these resources remotely via a web browser and programmatically via APIs and commandline interfaces (CLIs). The enterprise pays only for what it consumes and does not have to worry about long-term, multiyear commitments or capacity. Cloud computing is not a replacement for traditional models. It can replace data centers in some scenarios (e.g., flexible workloads) and cannot replace data centers in others (e.g., fixed workloads). The answer is complicated and beyond the scope of this chapter. Cloud computing has several benefits, making it an attractive business and economical alternative to traditional models. Here are some of the benefits of using cloud computing and related customer services:
• Cost: Cost is probably at the top of the list of benefits, and it is
also the most misunderstood of all the elements associated with cloud computing. Typically, cloud computing should be cheaper than traditional models. Still, it depends on the type of workload and the ability to utilize cloud capabilities such as on-demand usage, elasticity, and discounting mechanisms to make it cheaper. Cloud usage and cost management continue to be challenging, and enterprises still see significant waste when using these services.
1
https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-145.pdf
28 | Successful Management of Cloud Computing and DevOps
•
Let us use a simple example of when using cloud computing will not reduce your costs. Suppose that your organization has an enterprise application utilized 24/7/365, with a consistent number of human interactions and connections (API calls) per hour. You don’t see the demand for this application increasing or decreasing in the next few years. Typical steady-state workloads like this do not require additional capacity, whether central processing unit (CPU), memory, disks, or network, or performance improvements. One way that cloud computing can lower costs is if your application can benefit from autoscaling or on-demand usage. In this situation, since demand is consistent, you will need to run infrastructure resources 24/7/365 with no opportunity for optimization. Hence, utilizing cloud computing in this scenario may not reduce your costs. Reliability and availability: The reliability and availability expectations of the data center are decided based on the business critical nature of the service that it provides. Reliability refers to the ability of a system or service to meet specified and expected performance criteria over a given period. A common way to measure reliability is by using a metric called mean time between failures (MTBF). Availability refers to the ability of a system or service to remain operational for its intended purpose. It is common to hear the phrase “five nines (99.999%) of availability” as the expression of the expectation for critical services such as a bank’s automated teller machines (ATMs). The greater the availability, the more stable a system or service is. When a system or service is not reliable or keeps crashing, customers cannot depend on it, as they are not sure if they will be able to use it the next time they need to. Such issues lead to poor customer satisfaction and loss of business, as the customers will change to other providers with more dependable systems and services. This can be devastating in industries where the cost of acquiring a customer is high and staying ahead of the competition is tough. Where there is a need for the most reliable service, provisions should be made to monitor it on an ongoing basis by support team staff who can respond to any alerts raised and have dependable communication systems that can keep affected users informed of the downtime situation. The maintenance windows of the data center are also developed with the business requirements. Typically, operating system updates, security patch installations, hardware component replacements, and other components are considered activities that
Overview of the Cloud | 29
•
happen during maintenance windows. Some of these maintenance activities may require the system to be restarted. In some cases, it is unthinkable for the system to go down for maintenance during business hours, so this needs to be planned for a time when no customers are negatively affected. Due to the criticality of the business, in cases where the data center service can never be down, even for maintenance, additional provisions will be needed to keep the business operating. One approach is to have multiple servers or sets of servers that can support customers’ needs and ensure that the servers are upgraded and restarted one at a time, while retaining at least one fully functioning server to serve customer needs at any given time. For services just operating during business hours that have very low user traffic during certain times of the day, those are good times for maintenance. Depending on your operational-level agreements (OLAs) or SLAs, you will require data backups, disaster recovery, and business continuity measures to be in place for your computer resources. Enterprises will incur significant effort, as well as time and money, building and operating multiple data centers in the traditional model. CSPs operate in numerous locations, making it extremely easy to distribute applications, balance traffic across multiple locations, back up data, and recover from a disaster. In the cloud, the responsibility of designing and architecting your applications and infrastructure is yours. The CSP provides multiple locations, but you must decide whether you have to operate in multiple locations to meet OLA or SLA commitments. For example, on the East Coast of the United States, AWS has two regions, with northern Virginia having six availability zones (AZs) and Ohio having three AZs. In AWS terms, an AZ is a physically isolated, independent data center (i.e., it has separate power, cooling, physical security, and network systems). To run your application with redundancy, you must operate in multiple AZs within a region or across regions. Security: In the human body, the heart is one of the vital organs. Similarly, consider the data center as the heart of IT, which needs adequate safeguards in place. There are various aspects of security to consider, implement, and follow: Security of your applications and infrastructure is critical for any organization. Government- and country-specific security and compliance requirements make it hard for any organization to meet its security requirements at scale. CSPs meet this challenge by providing cloud computing at scale across multiple locations and have dedicated cloud environments for the government (e.g.,
30 | Successful Management of Cloud Computing and DevOps
GovCloud from AWS). CSPs have enough resources to meet these stringent security requirements and can move quickly and handle vulnerabilities rapidly. The following are challenges that must be considered regarding the security of data from the cloud perspective:
• Data ownership, operating model, and regulatory compliance • Data integrity and data availability • Integrating with existing security policies, framework, tooling, and • • •
audits Authentication and identity management across a hybrid organizational setup Privacy and data protection with changing regulatory standards Compliance standards (e.g., Statement on Standards for Attestation Engagements 16 [SSAE16], System and Organization Controls [SOC] 1/2/3, GDPR, Control Objectives for Information and Related Technologies [COBIT], International Organization for Standardization [ISO], and Information Technology Infrastructure Library [ITIL])
There is a need to establish specific practices within organizations for access management and security control, such as the following:
• Cloud identity and access management • Cloud cybersecurity • Cloud data protection • Cloud security architecture These focus areas will ensure that proactive planning, monitoring, reporting, and troubleshooting of security and data aspects of the cloud occur. Getting the scrutiny right across the platforms, both on the premises and in the cloud through a well-informed and orchestrated cloud security architecture, is very important. Generally, cloud security and cyberrisk monitoring can focus on the following specific areas:
• Data and access to data • Application security • Infrastructure security • End-user security • Network security • Identity and access management • Cloud security audit
Overview of the Cloud | 31
Ensuring the security of the physical boundaries of your data center is the first step. Among other things, you want to define an access strategy that determines the identity of visitors, the area they should be able to access, the length of time to give such access, when to revoke it, the enforcement strategy of access, and how to detect and respond to violations of access. One common technique includes limiting entry points, monitoring the entry points, and regularly reviewing access logs for the entry points. Another technique is using physical barriers and crashproof obstacles to create a buffer zone around the data center facility. Without physical security measures in place, any malicious entity could break into your data center and gain unauthorized access to data or even steal it for nefarious purposes. Such breaking and entering has been committed by miscreants all over the globe, and data centers are not safe against such threats, which could lead to a disastrous outcome for your customers and, consequently, your business. The prevention of such events can be as simple as following authorized users as they swipe their badges at an entry point to gain access. This system is known as tailgating. Threats can also be physical; an intruder can ram a heavy-duty vehicle into an unreinforced wall and remove your servers from the premises. Enterprises that own and operate their data centers have to make huge capital investments in order to protect data centers against physical threats. CSPs operate data centers on a much larger scale than enterprises, often globally. They have standardized practices to help them meet regional and country-specific security and compliance standards with their extensive certifications and accreditations.
• Network security: Physical security and network security have many
similar considerations. You need a strategy to determine visitors’ identity, provide and revoke access, and detect unauthorized access. Firewalls, intrusion detection software, and other similar systems are common in this area. One example of a recent network security breach was what became known as the “SolarWinds hack.” A widespread intrusion campaign occurred in which malicious actors, with the identifier UNC2452, gained access to several organizations across the world, including multiple government agencies, that were clients of a networking infrastructure company called SolarWinds. UNC2452 targeted both private and public enterprises, and the first step was to gain access to the company’s Orion IT monitoring and management software. Then a malware program called SUNBURST was released, which allowed unauthorized third parties to communicate with the affected servers and give them the ability to transfer and execute
32 | Successful Management of Cloud Computing and DevOps
•
•
•
•
files, profile systems, reboot machines, and disable system services. SolarWinds disclosed in a filing to the Securities and Exchange Commission (SEC) that a little more than 10% of its 300,000 customers were Orion customers, and that fewer than 18,000 of those may have had an installation of the Orion product with the malicious code. This type of network security violation has a nervewracking impact on businesses, customers, and society in general. Data security and backups: Data security involves protecting your data from unauthorized access and making appropriate backups so that you can restore it when needed. A stringent process of regular backups, with a well-defined policy outlining the backup frequency (e.g., daily, weekly, or monthly) and identification of storage locations, helps in restoring customer trust and minimizes business disasters. Security standards: There are a number of industry standards that data centers are expected to meet, such as the System and Organizational Control 2 (SOC2). SOC2 compliance tells your customers and other stakeholders that you have appropriate controls regarding information security and are prepared to protect their data. SOC2 covers these categories: security, confidentiality, processing integrity, availability, and privacy. ISO 27001 is another standard that serves a similar purpose. Speed: CSPs make it easy to provision and consume vast amounts of resources via on-demand self-service. These resources are ready within a few minutes, providing the flexibility to experiment or launch new products and services quickly without worrying about capacity and other nuances that typically are necessary to consider in a traditional model. In a traditional model, once you have the budget approval for new computing resources, you have to deal with supplier lead times, which can take several weeks or months. Once the hardware arrives, you will need to install and configure it, adding more delays. Cloud computing allows you to access computer resources within minutes, saving you significant amounts of time and effort. Global availability and compliance: As your enterprise enters new regional markets, you will have to address country-specific requirements. For example, some countries require that you operate all the computing infrastructure within their borders and comply with country-specific requirements. In this scenario, enterprises are forced to operate data centers locally within these countries and make significant investments in meeting compliance requirements. Operating data centers in foreign countries can be a daunting
Overview of the Cloud | 33
•
•
•
task that requires sourcing, procuring, and managing them by on-site staff. If the enterprise does not have the personnel to do this, then it will lose several months while procuring and building these capabilities. On the other hand, CSPs have cloud computer resources available in multiple countries and can meet countryspecific security and compliance requirements more easily. Scalability: If there is one significant benefit of cloud computing, it is scalability. CSPs allow on-demand usage of resources with the ability to scale out and back depending on your business needs. Computer resources are available when you need them and ready to use within minutes. Capacity planning: One of the most challenging problems in operating data centers is capacity planning. Unused capacity is expensive and ties up capital, and less capacity can significantly affect organizational performance, potentially lowering and contributing to an unsatisfactory customer experience. CSPs build and operate very large-scale data centers that provide sufficient capacity to meet most enterprise requirements. Pandemic-related benefits: COVID-19 has forced enterprises to deal with their entire workforce having to work remotely. As staff shifted to working away from the office, enterprises have had to deliver all the IT services that were typically delivered on-prem to remote locations. The cloud has played a huge role in enabling IT to provide online services quickly by offering flexibility and on-demand usage.
Types of Cloud Computing As illustrated in Figure 3.1 earlier in this chapter, cloud computing has evolved since the 1990s and supports various models depending on what services a business needs. To better understand how this evolution came about, it is important to recognize the deployment and service models that have developed. NIST defines2 four deployment models: private, public, hybrid, and community cloud. The cloud service models, as defined by NIST,3 are as follows:
• Software as a Service (SaaS): Provides the ability to run applications • 2 3
on cloud-based software Platform as a Service (PaaS): Provides the ability to use tools and is provided by a vendor to build and run applications
https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-145.pdf https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-145.pdf
34 | Successful Management of Cloud Computing and DevOps
• Infrastructure as a Service (IaaS): Provides the ability to use cloudbased infrastructure instead of on-prem infrastructure
Any organization considering using cloud computing must consider the type of deployment and service model before starting.
Deployment Models The cloud deployment model defines where computing resources are positioned and who has control over them. The cloud deployment models, as shown in Figure 3.2, defines where computing resources are positioned and who has control over them. For each of the cloud deployment models described here, the differentiation is in the administrative control that it offers. Also, based on the deployment model, there can be some level of financial implications, and you can have better control of the security of your data and resources. Assume that you are the administrator for your enterprise, which is running a private cloud in your own data center. You will end up maintaining all of the infrastructure (i.e., servers, network, storage) in order to have full control over your data and manage access better. Almost all public cloud providers offer security and access controls, and there may be very few reasons to keep the private cloud operational. To solve the privacy issue, there is a concept called the virtual private cloud (VPC), which pretty much allocates a virtual private network (VPN) environment inside the public cloud. Private Cloud. This is a cloud service dedicated to a single enterprise and is completely isolated (i.e., a share-nothing model); it can be located on- or off-site. Computing resources can be physical or virtualized and provisioned via self-service, are elastic, and can be automated and priced per use. The enterprise can choose to build and operate a private cloud, use a third party, or use a combination to provide these services. Computing resources are accessed via the enterprise network if it is on-site or via the internet if a third party operates it. The key takeaway is that the computer resources are private to an enterprise and not shared with others. Enterprises lean toward the private cloud specifically for greater control over security and cost savings—many workloads are more cost-effective to run on a private cloud than the traditional model. The private cloud is also widely adopted, and in many cases, enterprises are considering migrating applications from public clouds back to private clouds for increased control and security reasons. Virtual private cloud (VPC). The concepts of the VPN and VPC are almost synonymous. The term VPC is used by AWS and GCP, and VNet is the term
Networking
Data Center
Networking
Data Center
Data Center
Networking
Storage
Servers
Virtualization
Operating System
Middleware
Runtime
Data
Applications
Hosting
Provider manages
Figure 3.2 Cloud deployment models.
You manage
Storage
Operating System
Operating System
Storage
Middleware
Middleware
Servers
Runtime
Runtime
Servers
Data
Data
Virtualization
Applications
Applications
Virtualization
Colocation
Data Center
Networking
Storage
Servers
Virtualization
Operating System
Middleware
Runtime
Data
Applications
IaaS
Provisioning and management interface
Traditional On-prem
Provisioning and management interface
35
Data Center
Networking
Storage
Servers
Virtualization
Operating System
Middleware
Runtime
Data
Applications
PaaS
Data Center
Networking
Storage
Servers
Virtualization
Operating System
Middleware
Runtime
Data
Applications
SaaS
Provisioning and management interface Provisioning and management interface
36 | Successful Management of Cloud Computing and DevOps
AWS Cloud AWS VPC2
AWS VPC1
Secure Connection On-prem Infrastructure
Figure 3.3 A virtual private cloud.
used by Azure; either way, it is a mechanism identifying which public cloud can offer you a portion of the cloud assets to a specific enterprise in a private setup. The VPC allows you to pool a set of resources (i.e., computer, storage, data, and network) in private space using Internet Protocol (IP) address subnetting along with VPN functionality of the network. VPC also offers a secure and scalable connection over the internet to a public cloud. It gives you the feeling of having your own cloud. VPCs use both IPv4 and IPv6 private address allocations. Figure 3.3 represents a scenario where an on-prem infrastructure running out of a data center can connect seamlessly to the cloud infrastructure via a VPC connection. Public cloud. Unlike private clouds, public clouds are designed and built to allow multitenant usage of on-demand services using shared computer resources. These resources can be physical or virtualized and provisioned via self-service, are elastic, and can be automated and priced per use. Any consumer, including enterprise users, can sign up online using a credit card and start using resources within minutes, all via the internet using a web browser, programmatically, or via a CLI. Hybrid cloud. This model is where an enterprise combines private and public clouds by sharing data and services between them. For example, an enterprise might choose to handle capacity by dynamically managing application traffic bursting to a public cloud from a private cloud. Bursting, or directing overflow traffic to an additional cloud, often addresses capacity-related issues, but it is challenging to execute because of the differences between
Overview of the Cloud | 37
these clouds’ interfaces. Several enterprises offer solutions that enable the bursting of applications from one cloud to another, but these technologies are not mainstream yet. Community clouds. The computer resources of community clouds are shared by several organizations supporting a specific community with shared objectives and requirements such as a joint mission, security requirements, policy, and compliance considerations. It may be managed by the organization, outsourced to a third party, or a combination and can be located on- or off-site. Organizations using community clouds benefit from shared costs while providing community-specific privacy, security, and regulatory compliance. Apart from the four deployment models described here, enterprises that want to use multiple clouds for various reasons, such as to avoid lock-in or to utilize unique services such as artificial intelligence/machine learning (AI/ML) from a specific CSP, use a combination of clouds. This new deployment model is referred to as multiclouds, as described next. Multiclouds. The term multicloud refers to a model in which an enterprise uses two or more clouds to run applications. Designing applications to run on multiple clouds is challenging due to the lack of interoperability between CSPs, making it difficult to design portable applications. While container technologies such as Docker (described in detail in Chapter 6) isolate an application from its underlying resources, they work only for simple workloads that do not have external dependencies. Typically, enterprises run different applications on different CSPs to avoid complex deployments.
Service Models Service models let you choose how much control you have over your cloud computer resources, development platforms, and applications.
Infrastructure as a Service (IaaS) In IaaS, the CSP is responsible for the ownership and management of the data center, computer resources (e.g., servers, storage, network), and virtualization layer, including the self-service interface for on-demand provisioning and usage. Further, the CSP is also responsible for the security and compliance of all IaaS services. Enterprises have access to the same technologies and capabilities as a traditional data center, but without physically managing it. The CSP meters and invoices enterprises per second, per minute, per hour, or based on the quantity of resource consumed (e.g., GBs of S3 storage consumed).
38 | Successful Management of Cloud Computing and DevOps
IaaS Advantages IaaS offers several advantages over traditional data centers, including the following:
• Highly scalable, on-demand computer resources, provisioned within minutes
• Computer resources that can be provisioned and decommissioned • • • • • •
programmatically (via scripts and APIs), speeding up deployments and reducing the number of staff required to manage resources Pay-per-use and pay-as-you-go models Reduced capital expenditures (CapEx) as monthly IaaS expenses are considered operational expenses (OpEx) Strong infrastructure security and compliance Available across multiple locations, regions, and countries Access to state-of-the-art technology Flexibility to operate highly available and redundant infrastructure at scale
IaaS characteristics include the following:
• Multitenant and shared infrastructure: There are options to use • • • • •
dedicated resources if your workload requires it. Virtualized infrastructure: They are generally virtual, but most CSPs offer physical (bare-metal) servers for specific use cases. Highly scalable services: Enterprises can provision and use a single compute instance or thousands of instances. Complete control of the operating system: Enterprises have full access to and control of their operating systems, giving them the flexibility to design and operate applications with flexibility and control. Shared-security: CSPs take care of IaaS security, and enterprises have to manage application security. Network-driven: All services are accessed and managed online via the internet using a web browser or programmatically using scripts.
Use Cases for IaaS There are several use cases for IaaS and specific situations when IaaS is more advantageous than traditional data centers. Enterprises must evaluate if IaaS meets their business requirements by looking at the following points:
Overview of the Cloud | 39
• Disaster recovery (DR) and backups: DR sites do not have to run
• • • • •
24/7, but they do require replication and standby sites. IaaS is a good fit for this scenario, as you can enable standby sites to take over within minutes of a disaster. Backups are another option using object storage (e.g., S3); enterprises can back up and archive terabytes of data at a much lower cost than traditional models. Development and test environments: Typically, developers need access to several computer resources during their development and test cycles, including load-testing environments. IaaS serves this purpose due to its dynamic on-demand nature. Large or dynamic workloads: Big data analytics and the AI/ML type of workload process vast amounts of data and consume extensive computer resources, making it easy and cost-efficient using IaaS. Proofs of concept: If an enterprise is looking to offer cloud-native applications, then using IaaS would assist in this effort because of faster provisioning times. Market access: IaaS enables enterprises to get immediate access to computer resources across the globe. This is very attractive for enterprises that do not want to invest in operating data centers in multiple geographical areas. Data center extension: When you need to increase your data center’s capacity without having to make capital investments, IaaS offers a flexible way of managing capacity at lower cost and greater speed.
IaaS Challenges There are several benefits of using IaaS in an enterprise, but there are also several challenges. Enterprises should understand these challenges and plan to address them up-front before starting to use IaaS. Challenges may include the following:
• Costs: With traditional data centers, you have a fixed inventory • •
and sunken costs. With IaaS, since you can dynamically provision instances, monthly fees can quickly add up and get out of control if not managed well. Vendor lock-in: There is always the risk of getting too comfortable with a particular IaaS platform and creating a situation that prevents you from quickly switching CSPs. Process changes: IaaS will require changes to how IT and engineering teams operate when using traditional models. Enterprises will have to invest in changing or adjusting existing workflows to accommodate IaaS.
40 | Successful Management of Cloud Computing and DevOps
• Security risks: IaaS security is robust, but the cloud is about
• • • •
shared responsibility between a CSP and an enterprise. Risks such as public exposure of customer data (e.g., personally identifiable information), including malicious access to confidential information, can be devastating. Enterprises must understand cloud architecture and security and be able to address potential security breaches quickly. Complex integration: If you run a hybrid cloud environment, integrating the clouds can be complex and expensive. Limited customization: Public IaaS is designed to support everyone, so customizations are not possible. Skills: There is a shortage of skilled and knowledgeable resources, such as DevOps and site reliability engineering (SRE). Hence, you will have to invest in extensive training of staff. Support: Dedicated support teams are provided only with enterprise support contracts (which incur extra costs). Failing such agreements, IaaS support will be accessed primarily online or via webchat or email.
Examples of IaaS include the following:
• AWS • Azure • GCP • Alibaba Cloud • Oracle Cloud Infrastructure • IBM • Tencent Cloud Platform as a Service (PaaS) In PaaS, CSP provides a cloud-based software development platform for software developers to manage their application’s entire software life cycle (including development, testing, and deployment). Absent this system, software development teams have to set up, configure, and maintain the entire software development environment and framework, constantly patching and upgrading it. PaaS providers take on all this responsibility and provide an easy way for developers to manage their applications. Typically, PaaS provides the following features:
• Infrastructure: PaaS providers utilize IaaS to provide the underlying
computer resources, and hence all of the benefits and challenges of IaaS are part of PaaS.
Overview of the Cloud | 41
• Software: Known as middleware, this type of software consists of •
the operating systems, language-specific compilers, interpreters, libraries, frameworks, and build tools. Management interface: The interface, whether a graphical user interface (GUI), a CLI, or an API, is used to configure, set up, build, deploy, and manage the PaaS environment.
There are several types of PaaS,4 but the three most popular are the following: 1. Prominent SaaS vendors offer PaaS as part of an SaaS product. SaaS providers, such as Salesforce, Workday, and Intuit, provide software developers an ecosystem around their SaaS application that they can utilize to quickly extend, customize, or build a new application on top of the SaaS platform. 2. PaaS tied to an IaaS provider is offered by the IaaS provider and integrated with its IaaS platform. It generally provides a layer of abstraction from the underlying IaaS, enabling software developers to develop and deploy applications rapidly. AWS Elastic Beanstack is a good example of PaaS provided by AWS. 3. Open-cloud PaaS is independent of SaaS or IaaS providers and provides greater flexibility to software developers, but it can be expensive. For example, Cloud Foundry from VMWare supports several development languages, including Java, Scala, Ruby, and .NET, and can be deployed on multiple public clouds, including VMWare Cloud. The advantages of PaaS are as follows:
• Scalability: Since PaaS uses IaaS for underlying infrastructure, • • •
4
scaling your applications is built in. Scaling your application using PaaS is more efficient than scaling with IaaS. Simplicity: PaaS makes it easy and fast to develop, deploy, and manage applications without the developer understanding the nuances of maintaining a platform. Cost: Configuring and maintaining any runtime effort is expensive and takes a lot of skill and effort. PaaS provides a bundled environment for software developers, reducing the need to hire additional personnel. Improved time to market: PaaS increases the speed of development by reducing all the complexity and effort related to configuring and setting up the runtime environment. Hence, developers can focus on developing the application.
https://www.techtarget.com/searchcloudcomputing/definition/Platform-as -a-Service-PaaS
42 | Successful Management of Cloud Computing and DevOps
• Continuous updates: PaaS comes with frequent updates to all the
necessary security and component patches. This ensures that your application runs on the latest stack, drastically reducing application bugs and security risks and vulnerabilities that can be exploited to gain access to the system.
PaaS has many characteristics that define it as a cloud service, including the following:
• It builds on IaaS, so resources can easily be scaled up or down as • • •
your business changes. It provides services to manage the life cycle of software application development, testing, and deployment of apps. It offers a multitenant environment similar to IaaS. It integrates web services, databases, and other development tools and technologies, streamlining software development.
The challenges of PaaS include the following:
• Security: While PaaS security is mature, like IaaS, it is a shared, • • •
multitenant platform, and the CSP is responsible for the security of the platform. The enterprise is responsible for your application security. Integration: It is complex to integrate applications running in PaaS with traditional data centers. This will limit the scope of what applications you decide to run in a PaaS or incorporate within a hybrid environment. Vendor lock-in: PaaS makes it easy to develop and deploy applications by hiding all the complexity. This could lock you into a particular environment, making it challenging to migrate to another PaaS. Runtime customizations: The PaaS platform may not be optimized for a specific version of your framework, which could affect application performance.
The advantages of using PaaS include the following:
• Reduce costs: Like IaaS, you pay for what you consume on a pay-as• •
you-go model and don’t have to deal with hardware and software maintenance and personnel costs. Reuse code: Using a PaaS platform allows you to share and reuse code with other projects or initiatives. Speed: PaaS services further reduce what the enterprise has to manage and take away all the complexity of managing software development and runtime environments, which reduces the time and effort by DevOps teams to deploy and manage applications.
Overview of the Cloud | 43
Examples of PaaS include the following:
• Google AppEngine • AWS Elastic Beanstalk • Heroku • Force.com • OpenShift • Apache Stratos • Magento Commerce Cloud Software as a Service (SaaS) The SaaS model involves the delivery and consumption of software applications over the internet. Generally, applications are browser-based, but some may require installing a client on your laptop, desktop, or mobile device. The CSP is responsible for everything from the data center to the application layer. SaaS CSPs may choose to use IaaS and PaaS to operate the platform or operate their own data centers and software stacks. The SaaS CSP will invoice the enterprise periodically on a monthly or annual subscription basis for usage. The enterprise subscribes to a SaaS platform and is responsible for application customization, user management, access control, and usage. SaaS enables customers to quickly get access to a software platform without having to worry about operating it. The characteristics of SaaS include the following:
• SaaS is offered on a monthly, annual, or multiyear subscription basis. • Customers don’t have to worry about hardware and software upgrades, capacity management, and security and compliance. All of these responsibilities are handled by the SaaS provider.
The advantages of SaaS include the following:
• Lower up-front cost: SaaS is offered on a per-user or per-device
•
basis on a monthly or annual subscription. There is also flexibility to increase or decrease usage based on the business’s needs. The enterprise does not incur any additional costs other than subscription fees. Quick setup and deployment: SaaS applications require minimal configuration. It generally involves setting up users, customizing the platform, and configuring access control. Once you have a commercial use agreement in place, you can be up and running within a few days.
44 | Successful Management of Cloud Computing and DevOps
• Easy upgrades: Everything, including security patches, hardware • •
upgrades, and application updates, is the SaaS provider’s responsibility. Accessibility: SaaS applications usually can be accessed via a web browser on laptops, desktops, and mobile devices from anywhere in the world. Scalability: SaaS providers manage capacity and performance based on the number of customers, and hence the enterprise does not have to worry about scalability.
SaaS sometimes has certain shortcomings, including the following:
• Lack of control: In-house software applications give businesses a • • • •
greater degree of control than hosted solutions, where control is maintained by a third party. Typically, everyone has to use the latest version of a software application and cannot defer making upgrades. Security and data concerns: Access management and data privacy are significant considerations when using the cloud and hosted services. Limited range of applications: While SaaS is becoming more popular, there are still many applications that don’t offer a hosted platform. Connectivity requirement: Since the SaaS model is based on web delivery, you will lose access to your software or data if your network connection fails. Performance: SaaS may run at somewhat slower speeds than on-prem client or server applications, so it’s worth keeping performance in mind before switching.
SaaS may be the most beneficial option to use in several situations, including the following:
• Small and mid-sized businesses that cannot afford computing or • •
application investments can still compete with larger enterprises using SaaS services. Short-term projects that require quick, easy, and affordable collaboration can benefit from it. Applications that aren’t needed too often, such as tax software, are a good fit for SaaS, as are those that need both web and mobile access.
SaaS examples include the following:
• Salesforce • Workday
Overview of the Cloud | 45
• Microsoft Office 365 • Cisco WebEx • ServiceNow Function as a Service (FaaS) FaaS takes computing to the next level by allowing software developers to create and run a cloud function without worrying about the underlying computer infrastructure. FaaS is similar to PaaS and has been around since 2010. AWS launched its AWS Lambda service in 2014 and was the first major CSP to support FaaS, followed by Azure and GCP. CSPs manage the underlying infrastructure, resources, and capacity and provide APIs to the developer that can configure and manage their function. FaaS abstracts the infrastructure and runtime environment more than PaaS, making it easy and straightforward to run specific tasks instead of an entire application. The CSP charges you for the usage of the function but not for the computer resources consumed; thus, FaaS is also called serverless. Functions are written in one of the supported runtime environments provided by the CSP. For example, AWS Lambda supports these runtime environments:
• Node.js • Python • Ruby • Java • Go • .NET • Custom Runtime: If you plan on running a function using another runtime environment, you will need to build a custom runtime environment.
FaaS provides all the benefits of cloud computing. But despite this, adopting a serverless architecture at scale currently presents the following challenges:
• Complexity: FaaS introduces another layer of abstraction, which • •
developers must account for in their architecture. They have to plan for maintaining functions, including managing security challenges. Maintenance: Not all the performance monitoring and debugging tools for applications support functions yet, making it hard to support and debug functions in a production environment. Performance: There is still a high level of uncertainty about the performance, scalability, and costs of running functions, as enterprises are just starting to explore FaaS technology.
46 | Successful Management of Cloud Computing and DevOps
• Control: Without control over the underlying infrastructure,
enterprises have to trust the CSP to support their application stacks.
Addressing these challenges will require understanding FaaS functionality and experimenting with it to identify cases that can benefit from FaaS, while not making your application stack complex to manage.
X as a Service (XaaS) Earlier, we laid the foundation for new business models and how enterprises are adopting these models. The subscription business model has shifted from requiring significant upfront capital investment to having predictable monthly operational costs, freeing up cash for other investments. Cloud computing has been a powerful driver of these new business models. X as a Service (XaaS) is a generalized term referring to cloud service delivery models. The letter X stands for anything as a service. With XaaS, you pay a subscription fee to use the functionality of “X” rather than buying the solution outright. You pay a subscription fee for as long as you need to use the service, but you never own or operate the product. Here is a list of some XaaS offerings:
• Communications as a Service • Content as a Service • Desktop as a Service • Network as a Service (NaaS) • Logging as a Service • Database as a Service (DBsaS) • Monitoring as a Service • Security as a Service • Storage as a Service • Testing as a Service (TaaS) When Should Cloud Computing Be Used? Up to now, we have seen the business and technical benefits of cloud computing. Now we will discuss some of the situations in which an enterprise might choose to use cloud computing. On-demand workloads. If a company’s workload or application has fluctuating demand (i.e., the demand is not continuously changing), it can benefit from the cloud’s elastic nature using autoscaling. Autoscaling enables
Overview of the Cloud | 47
a workload to add capacity automatically when there is demand and reduce capacity when the demand subsides. In traditional models, the enterprise will need to make significant CapEx investments to purchase and deploy additional computer resources to support peak demand requirements. This additional capacity will sit idle when there is less demand, thus tying up your capital unnecessarily. Shifting CapEx to OpEx. Shifting expenses from CapEx to OpEx frees up large amounts of capital for enterprises. If running data centers results in significant CapEx expenses, the cloud model reduces CapEx significantly by switching to a monthly subscription cycle. Before you decide to go down this path, you must evaluate your workload and see how much refactoring is needed to migrate to the cloud. Refactoring can be expensive and may have a material impact (in terms of time and money) on your business that may outweigh any CapEx savings. In addition, the cloud does not eliminate CapEx. (A detailed explanation of the differences between CapEx and OpEx is given in a later section.) Cloud-aware workloads. Well-designed applications that are microservicebased, such as support virtualization, and feature containerized workloads can easily migrate to the cloud with minor refactoring. Depending on the CSP that your enterprise chooses, you will need to make a few customizations to the workload to make it run efficiently in the cloud. Independent applications. If an application is self-contained (i.e., it packages dependencies such as runtime environment and data), it is much easier to migrate than an application with external dependencies. Test and development environments. The computing resources used for testing and development purposes usually have less stringent requirements than the production systems. In addition, these environments can hugely benefit from the speed and on-demand nature of the cloud. Load and performance testing. Apart from the standard development and testing environment, most enterprises conduct load and performance testing on their software. These tests are typically used to stress-test and validate how the software performs while being accessed by many users simultaneously. The data collected during these tests can be used to understand peak performance, steady-state performance, and breakpoints in the software. Load testing requires a large number of computer resources to simulate customers’ traffic patterns. They require massive capital investment to build the necessary capacity to drive high traffic volume (e.g., processing millions of transactions per minute). These tests are conducted for short times and are
48 | Successful Management of Cloud Computing and DevOps
primarily automated and can be run as part of the standard software development life cycle. The cloud provides plenty of capacity on short notice, so it is ideal for load testing. Market access. Entering new markets can be challenging due to countryspecific compliance and regulatory requirements around data sovereignty and operational requirements governing how to use computer resources within the country. CSPs offer their services in multiple countries, meet all country-specific requirements, and are certified to operate within the countries in question. This makes it easier for enterprises to launch products and services using CSPs instead of having to operate computer infrastructure in each country. DR and backups. CSPs work in multiple regions, making it easier for enterprises to operate numerous DR sites. In addition, services such as AWS S3 offer 99.999999999% durability and 99.99% data availability at a much lower cost, making management of data backups a lot easier and affordable.
CapEx or OpEx? When procuring infrastructure resources such as servers, storage, networking equipment, software, or services, enterprises often have two choices:
• Obtain them via CapEx, which is the traditional way that enterprises •
procured computer resources. Obtain them using OpEx, which is how cloud computer resources are available (i.e., on-demand and pay-per-use).
Before we get into the details of how CapEx and OpEx apply to cloud computing, let’s quickly review their fundamentals. CapEx. CapEx infrastructure expenses result in full up-front payment for long-term (i.e., beyond one year) benefits to an enterprise. For example, investments in building a data center and procuring generators, servers, storage, networking equipment, and software are all classified as CapEx. In most cases, the costs associated with the maintenance of this equipment are also classified as CapEx. Long-term investments generally benefit the enterprise over multiple years. Hence, the investment must be amortized or depreciated over its life span rather than being a single-year tax deduction. CapEx expenses are sunken costs that cannot be recovered before the investment period ends. For example, if you make a significant up-front investment to acquire computer resources because you anticipate demand increasing over the next few years,
Overview of the Cloud | 49
and the market conditions change, you have tied up your capital and cannot pivot quickly to adapt. CapEx tracking and allocation also take extra effort to manage and can be challenging when calculating the actual cost and value of investments. In addition, this type of long-term investment will require ongoing maintenance and replacement, which often demands additional CapEx investing. OpEx. OpEx expenses are part of the ongoing operating costs to run a business. This includes services and intangible items that are used monthly or quarterly and must be replenished. OpEx expenses are necessary for running your business and are always short-term, recurring expenses (i.e., on an as-needed basis). OpEx costs are subtracted from revenue and fully taxdeductible, contributing to increased profit margins. OpEx is easy to manage, as it reflects the business’s costs, benefits, and value. Migrating to the cloud has several technical and financial benefits. However, it is hard to know if the CapEx or OpEx model would work best for you without doing a thorough TCO analysis. When we talk about cloud computing, we assume that that is an OpEx expense. However, this is not always the case. Cloud computing can be a mix of CapEx and OpEx, depending on how you consume the services, and thus it requires advanced planning to optimize costs. Let us take a simple example of a mid-sized organization with a monthly budget of $1 million to be spent on IaaS and PaaS. Assume that all of this expense is going to on-demand services (i.e., you consume based on your business demand and the CSP invoices you monthly for what you use). This expense meets all the criteria for OpEx. CSPs offer additional discounts if you are willing to make usage commitments in advance. For example, AWS has the concept of reserved instances (RIs), which allows an enterprise to get discounts on its hourly usage of computer resources (e.g., EC2) if it commits to a one-year or three-year agreement. Discounts can vary depending on the instance (type, region, tenancy, and operating system), offering class (standard versus convertible), the term (one year versus three years), and the payment option (all up-front, partial up-front, or no up-front). For instance, you are paying AWS up-front for the opportunity to run computer resources at a discounted hourly rate for the next three years. In a best-case scenario, standard RIs for a threeyear commitment and all up-front payments can yield up to a 72% discount. While this may appear very attractive to an enterprise, it means that you are committing to using the services for three years. You are paying a major portion of the resource (“the asset”) up-front, and then you will be invoiced every month for the remaining part of the asset. The unused portion of the
50 | Successful Management of Cloud Computing and DevOps
resource has value that can be put on the balance sheet as a CapEx investment and be amortized over three years. By purchasing RIs, you benefit from the hourly usage discounts, but your financial model has switched from OpEx only to a combination of OpEx and CapEx. Most enterprises utilize RIs extensively, and their cloud spend is a mix of OpEx and CapEx. What happens if your business requirements change in the second year and you realize that you need to make changes to the services, or even abandon them altogether? Just as in traditional data centers where you could sell the excess inventory in the used market (e.g., eBay), AWS provides you the option of selling your RIs in a venue called the RI Marketplace. AWS charges a 12% fee for this service, and you might be liable for taxes (e.g., sales tax) as well. Most enterprises don’t have an easy means to account for the revenue they get by selling unused or excess cloud computer resources. If the enterprise does decide to sell unused RIs, it would not recover their entire cost. Classifying your cloud expenses and managing usage and cost are complex topics, and we will cover some of those fundamentals in Chapter 4.
Cloud Strategy New business models are forcing enterprises to innovate and go to market faster in order to compete and lead in their industry. The cloud enables more rapid innovation by drastically increasing the speed at which enterprises can create and launch products and services globally. However, most enterprises struggle to realize their full potential due to misalignments between their cloud strategy and business objectives. A good cloud strategy needs to go beyond the technical benefits of the approach, focusing on business priorities and showing how cloud adoption will help the business realize its goals. You don’t need a strategy when your users are just experimenting with using the cloud, but when you are ready to start adopting the cloud for real, it is essential. A cloud strategy documents enterprise objectives, risks (e.g., CSP lock-in), and cloud adoption guidelines. It is a high-level document, not a detailed implementation or migration plan. In smaller enterprises where IT provides all the computing capabilities, cloud adoption can be straightforward because IT is responsible for both the strategy and the execution. In larger enterprises with multiple business units, shadow IT teams often adopt the cloud in order to bypass IT controls. In these situations, an enterprise-wide cloud strategy becomes critical; without it, cloud adoption is ad hoc, which undermines the benefits of the cloud and increases the risk significantly. Creating a cloud strategy for such an enterprise is more complicated. You have to consider issues related to adopting it across several functions, including
Overview of the Cloud | 51
IT and business units. You have to factor in the enterprise-wide requirements, which vary across business functions. For example, within IT, you have to evaluate the impact of using the cloud on your current computing environment, applications, processes, standards, policy, architecture, and governance frameworks. Your cloud strategy will not work if you don’t understand and address these enterprise challenges. Let’s use a few examples to see how enterprise-wide cloud strategy will be shaped:
• One of your business units is planning to introduce a new
•
•
application. It wants to launch this service globally and requires computer resources across Europe and Asia. It is considering using a public cloud IaaS specifically for this application and will continue to operate some applications out of the data center. Your IT team has seen several technical issues over the last few quarters with its enterprise resource planning (ERP) system. The ERP vendor has an SaaS offering of the same product that addresses all these problems and lowers your on-prem operating costs of the ERP system. Therefore, using SaaS seems like an excellent option to improve customer experience and reduce costs. One of your enterprise applications sees huge data processing requirements at the end of every quarter, and you must increase your computing capacity for a few days per quarter but don’t have the budget for it. Utilizing the cloud to manage capacity on demand seems like a winning strategy.
Your cloud strategy must accommodate all of these scenarios and the challenges around them. The strategy must handle several dimensions, and hence it must work with the entire enterprise. An enterprise-wide cloud strategy must define a set of high-level guiding principles for cloud adoption and use. It should address the following points:
• Scope: Which applications and data are in scope, and which ones •
are not? For example, an enterprise might decide that its ERP systems are out of scope for cloud migration due to a contractual obligation with the current ERP vendor. Role of the cloud in the enterprise: Why is the enterprise considering migrating to the cloud? The answer will vary and will depend on each enterprise’s unique business needs. For example, some of the common drivers are wanting growth in a new market, cost savings, a competitive advantage, or a faster time to market. Without defining the role of the cloud in the enterprise, justifying adoption will become challenging.
52 | Successful Management of Cloud Computing and DevOps
• Technology and business challenges: This element links current
•
•
•
•
• • •
technology and business challenges to outcomes and asks the question: what challenges are limiting your ability to meet business objectives? For example, longer procurement cycles for computer resources, constant capacity issues, modern infrastructure, reliability, security, and compliance can be problematic and need to be addressed. Cloud model: Will the enterprise use a specific set of cloud services, and if so, which ones and why? If the enterprise chooses to use IaaS services, there should be justification of and guidance on how this specific service should be adopted to benefit the enterprise. It is essential to document the services that the enterprise plans to consume to avoid a situation where multiple services are consumed, creating a potential vendor lock-in situation. Multicloud: Using multiple clouds has both benefits and challenges. Suppose that the enterprise has decided to use multiple clouds for technical or business reasons. In this case, the strategy must outline the reasons for using multiple clouds and provide guidance on when to use a specific CSP. Enterprise impact: When documenting the enterprise’s impacts, they could be financial, security-specific, or cost- or compliancerelated. Is the impact over the short or long term? For example, during the migration phase, your costs will increase due to supporting two environments (legacy and cloud) simultaneously. Future state: The desired future state for cloud adoption over the next two to five years should clearly outline the realized benefits of adopting the cloud and delivering business outcomes. For example, it could include cost savings, faster release cycles, entry into new markets, and a more robust security and compliance framework. Dependencies: What additional capabilities must the enterprise have to make cloud adoption successful—for example, staff with cloud-specific skills, governance framework, security practices, data classification, and cost management? Security: Align the enterprise security strategy to support the cloud’s shared security model. Any compliance risks such as FedRAMP requirements must be addressed before migrating to the cloud. Support: How will the enterprise support cloud adoption? Will each business unit have its own DevOps or SRE teams, or will these tasks be centralized? How will cloud cost management and chargeback work? Will there be a single cloud management platform?
Overview of the Cloud | 53
• Exit strategy: If you decide to stop using a specific CSP and go
back to a data center environment or transfer to another CSP, what will it take to migrate your applications and data? What terms and conditions must you negotiate with the CSPs to make this easy and manageable?
It is highly recommended that enterprises start defining their cloud strategy before migrating to the cloud. However, most enterprises start adopting the cloud without a strategy and then realize the importance of having one.
Cloud Sourcing and Procurement In this section, we will focus on challenges with the sourcing and procurement of IaaS and PaaS. Procuring SaaS is slightly different, and we will not be covering that topic in this book. Sourcing and procurement for the traditional models are very involved, with multiple stakeholders. Procurement for traditional models involves numerous vendors, is highly customizable, has complex terms and conditions, and takes time. Larger enterprises tend to have dedicated teams that handle procurement and vendor management. Compared with traditional models, cloud computing services are very standardized, with terms and conditions and multiple procurement options. In addition, since the CSP provides all the services, there are no additional vendors and complexity, thus speeding up procurement. CSPs enable direct online procurement via self-service, with the customer paying with a credit card and agreeing to standard terms and conditions. However, they also are willing to negotiate an enterprise agreement (EA) and provide additional discounts based on a multiyear (one to five years) or monetary (dollar-per-year) commitment. Direct online procurement works well for small and midsize businesses (SMBs) that do not have a centralized procurement team. Larger enterprises have dedicated teams, complex procurement workflows and tools, and compliance requirements for vendor management. Direct procurement is quick and easy, but it leads to the following challenges:
• Discounts: Most CSPs offer discounts when negotiating an EA with an annual or multiyear purchase commitment. While this goes against the pay-per-use and pay-as-you-go models of the cloud, CSPs use this strategy to increase their revenue and demonstrate growth. These discounts can be significant if you centralize your procurement and negotiate a single EA along with enterprise-wide discounts. With direct procurement, you get standard discounts and terms and conditions, and you cannot negotiate additional discounts.
54 | Successful Management of Cloud Computing and DevOps
• Resource pooling: CSPs allow you to optimize costs by sharing
•
unused resources across the enterprise. However, they require all cloud accounts to be linked to an enterprise management account (formerly known as a master account), requiring an EA. For example, an enterprise can share AWS discounts at the corporate level to leverage the discounts centrally and pass them on to all the various functions using the cloud. Cloud sprawl: With multiple functions directly procuring cloud services, enterprises can end up with hundreds of cloud accounts, with multiple CSPs and no centralized visibility or governance. Without centralized visibility and controls on usage and costs, there is no guarantee that you will benefit from using the cloud.
We will discuss direct procurement and best practices when managing largescale cloud migrations in the subsequent chapters of this book.
Optimizing CSP Costs There are two paths for optimizing your cloud computing costs:
• Contract and discount negotiation • Infrastructure optimization Contract and Discount Negotiation Sourcing, procurement, and vendor management teams need to know how cloud computing works. As more enterprises migrate to the cloud, there is a gap in the skills and knowledge of procurement teams when dealing with CSPs. Enterprise IT and software engineers are early adopters who often see the technical benefits of cloud computing but do not focus on the terms and conditions or any discounts they might receive. As a result, the demand for and usage of cloud computing increase for technical reasons without evaluating TCO and return on investment (ROI). If your enterprise is in this situation, you can either hire a third party to negotiate on your behalf or invest in acquiring the necessary skills and knowledge. CSPs typically offer the following cost-saving options that enterprises can leverage without architecture or technical modifications to your application or workload. These are options provided by AWS; GCP and Azure offer some but not all of them:
• Private Pricing Agreement (PPA): AWS will offer you an additional
discount beyond the standard price as part of a PPA (formerly known as an Enterprise Discount Program) if the enterprise is willing to agree to a spending commitment. The higher and longer your
Overview of the Cloud | 55
•
•
•
•
commitment, the more significant the discount. For example, if you commit to spending $20 million per year for three years, AWS will give you a certain discount. However, if you commit to spending $20 million per year for the next five years, you will get a higher discount. It is important to note that these discounts apply to all services, but they may not apply to all regions (e.g., China). You need to pay careful attention to these exceptions, and if your enterprise operates in a specific region that is excluded, PPA discounts will not offer any benefit. Individual Service Discount Agreement: AWS will consider negotiating discounts for extensive usage of a specific service. For example, if your enterprise consumes extensive egress bandwidth, you can negotiate a specific discount for egress bandwidth in addition to the overall discount. For AWS to consider a servicespecific discount, you will have to be at or above the highest tier of consumption for the service. Enterprises that utilize extensive computer, storage, or network resources can try to negotiate individual service discounts. Reserved Instances Discount: This discount is available for everyone with or without a PPA. The enterprise commits to using a specific resource type for a one- or three-year term, and AWS gives you a discount on the hourly usage of the resource for the term. If cost savings is high priority, RIs can be a significant source of savings, in some cases offering up to a 72% discount. RIs apply to EC2 and RDS instances only and are very challenging to manage, as there are multiple criteria to consider before purchasing them. As noted earlier in this chapter, AWS does give you an option to resell your RIs on the RI Marketplace. Savings plan discounts: AWS introduced a number of savings plans at the end of 2019, bringing more straightforward cost savings into the mix. Savings plans provide more flexibility (applying across all regions and EC2 instance families) while offering the same cost savings (of 66% to 72%) as RIs. Unlike RIs, savings plans let you make an hourly commitment (i.e., you commit to a dollar-per-hour rate that you are willing to spend on resources that qualify for the plan in question). Savings plans apply to EC2, Fargate, and Lambda and cannot be sold on a marketplace. Spot Instance Discount: Enterprises can benefit from excess capacity in AWS data centers by bidding on their usage and getting up to a 90% discount. However, applications have to be designed to use spot instances and accommodate these resources’ sudden interruptions with short notice. Enterprises interested in using
56 | Successful Management of Cloud Computing and DevOps
•
spot resources tell AWS how much they are willing to pay per hour for these resources. Spot instances will continue to run until the cost of the resource meets or exceeds the agreed-upon price. They are not recommended for mission-critical workloads. However, they are a good choice for transient workloads such as media processing (e.g., stateless web services), video rendering, continuous integration and deployment (CI/CD), testing, and development workloads. AWS Credits: These credits typically are offered to smaller companies that are in the early stages of experimenting and are not ready to negotiate a PPA. ASW applies these credits to a company’s monthly invoice. AWS credits usually are given to start-ups and proof-of-concept projects that have the potential of generating future revenue for AWS. If your enterprise falls into one of these categories, you should negotiate for these credits. Under some circumstances, such as migrating an entire data center that could cost millions of dollars per year, AWS is willing to provide credits to offset some of the associated costs.
Infrastructure Optimization Activities Just as in the case of traditional models, there are plenty of opportunities for optimizing the usage of cloud computing resources. Optimization utilizes CSP capabilities such as the ones listed here, as well as services, in order to prevent waste of resources and improve performance:
• Rightsizing: Unless you are very diligent in continuously monitoring
•
performance, including CPU, memory, network, disk, input out per second (IOPS), and load, and rightsizing your computer instances, there will always be oversized computer resources. For example, your workload may require 8 cores, 16 GB of memory with minimal disk space, but what you provisioned was a 16-core EC2 instance with 32 GB of memory and solid-state drives (SSDs). Your provisioned EC2 instance has twice the capacity that you need and probably costs twice the amount per hour. CSPs and several thirdparty vendors provide usage, cost monitoring, and optimization tools to review such data periodically and make adjustments. Such adjustments are straightforward and can be applied within minutes by restarting the instance with a lower configuration. This would result in immediate savings. Reduce waste: Often, engineers start using several resources during testing or proof-of-concept stages, but they fail to stop using these resources once the activity is complete. For example, an unwanted EC2 instance not only has an hourly cost, but it also consumes
Overview of the Cloud | 57
•
•
•
•
other resources such as an IP address, storage, networking, performance monitoring, third-party software, and security services. Periodically monitoring for unused resources and stopping them can result in significant savings. It would be best if you had a welldefined process, and in most cases, set it up so you can automate the stopping of these resources using a third-party service. Optimized scheduling: If your development, quality assurance, and load-testing environments do not have to run 24/7, shutting down these resources during the night can save you eight or more hours of usage per day. Several third-party services can be used to automate the shutdown and start-up of these resources. Architecture optimizations: All of the strategies listed here are quick and easy and deliver savings immediately, but optimizing your application’s architecture can significantly reduce costs and improve performance. It takes extra effort, but it is worth doing. Architectural optimizations are modifications to your applications that utilize all the features in a specific cloud. There are several areas you can focus on when you want to optimize your architecture. We will cover a few critical areas in Chapter 4, and the rest you will need to explore on your own. Autoscaling: One of the significant benefits of cloud computing is elasticity, which allows you to increase capacity (scale out) and reduce capacity (scale back) on demand. CSPs provide services that enable you to scale out automatically, but your application needs to be modified to handle the task seamlessly. Let’s take an example of an application that sees peak traffic from 9 A.M. to 12 P.M. (i.e., three hours a day). During this time, the traffic increases to 10 times the normal amount. In a traditional model, you would provision enough capacity (servers, disks, network) to handle three hours of peak traffic per day, but then these resources would sit idle for the remaining 21 hours. Using the cloud’s autoscaling capabilities, you can automatically add more resources to handle peak traffic and scale back again once traffic returns to normal. Autoarchiving: CSPs offer block, object, and file-based storage. Block storage (e.g., AWS S3) is the most popular option, and it is often used to store large amounts of data. You pay for storage access (upload and download) and the total gigabytes of space consumed per month. AWS offers several classes of S3 storage designed to meet specific business requirements. For example, S3 Standard storage, the most expensive option, provides frequent, fast access to data. Suppose that your need is for backing up enterprise data that gets accessed only once or twice a year.
58 | Successful Management of Cloud Computing and DevOps
•
Using S3 Standard to store several petabytes of data can be very expensive. AWS provides an automated way to archive the data to a lower-cost and slower storage option such as S3 Glacier Deep Archive, without compromising the data’s durability and availability. Network optimizations: Data transfer is one of the most expensive and complex functions. Most CSPs allow free, unlimited ingress data transfer from the internet, but they charge you for data egress. The egress rates vary by service type (S3 versus Load Balancer), region (US East versus US West), and interregion (from US East to US West), and are different if you have a direct connection to AWS from your data center. For example, all uploads to an S3 bucket are free from the internet, but you have to pay $0.02 per gigabyte of egress bandwidth when you retrieve this data in the US East region. There are exceptions to this depending on where you are accessing it from, but in general, egress bandwidth is costly and can add up quickly.
Given the complexity in how pricing works for egress bandwidth, your application needs to factor in the cost of replicating data and architecture to optimize for egress bandwidth. Let’s assume you have a 1GB software image on an S3 bucket that you wish to make available for download to your customers via the internet. If you are averaging around 100K downloads per day, this would cost you about $2K per day and $60K per month just for the downloads. This does not factor in additional charges such as API requests made to download this data, which would drive the cost even higher.
TCO for Cloud Computing All for-profit enterprises care about costs and closely watch and control expenses to meet their financial objectives. The standard income statement captures costs as cost of goods sold (COGS) and operating costs. For a SaaS enterprise, COGS would include all the necessary expenses for operating the platform. Generally, COGS varies depending on the amount of usage; that is, as the number of users increases, the enterprise will need to invest more dollars in scaling the SaaS platform. COGS consists of the following:
• The IaaS and PaaS expenses used to operate the SaaS platform • The cost of royalties and licenses for commercial software used on •
the SaaS platform The cost of the staff necessary to manage and support 24/7 operations of the SaaS platform
Overview of the Cloud | 59
• The cost of staff for onboarding and technical support • Other direct costs required to deliver the ongoing service Operating expenses cover all the costs related to software development, marketing, and selling, including the following:
• Product development costs related to the SaaS platform • Fees related to the licensing of software development tools, •
technologies, and support systems The cost of computing resources used for development and testing environments
Typically, SaaS businesses should have a gross margin (calculated by [(Revenue–COGS)/Revenue]) of between 70% and 80%; in other words, your COGS should be in the 10% to 20% range. To remain within this range, enterprises need to understand and manage their technology-related costs. .
TCO was popularized by the technology consultant firm Gartner in 1986. Its consistent approach is recognized as an industry standard for the financial analysis of technology costs. Without an accurate TCO of the current technology environment, enterprises cannot determine the savings and benefits of migrating to a cloud computing environment. To show that cloud computing costs less than your traditional model, you must compare the TCO of your traditional model with the TCO when using cloud computing. When calculating TCO, the time frame to consider may depend on the technology’s entire life and its estimated annual depreciation. For most computing resources, three years is the typical period for calculating TCO, as the computer resources often become outdated after this time and need to be refreshed. When calculating costs, most enterprises are not aware of all the hidden costs, which are often not visible to the cost accountant and hence not captured accurately. The TCO calculation must include the total cost of purchasing and operating a technology product or service over its useful life. While the up-front price captures the costs of buying a product or service, it does not capture the total cost of operating it. The total cost must include all the expenses you pay to procure, maintain, and use the product or service during its entire useful life. For example, let’s say that your enterprise decides to use a third-party software product and signs a three-year agreement. The total costs are $500,000 and $100,000 (20% of the upfront costs) for annual support. The software vendor also offers professional service to get you set up and running for an additional one-time cost of $25,000. Since the software runs on site, you will require computer resources to run the software and staff to install, configure, and maintain the software over the three years. In addition, you
60 | Successful Management of Cloud Computing and DevOps
will require investments for supporting change management and training of staff. Unless the cost accountant is aware of and tracking all these additional expenses associated with this software, they will factor in only the up-front costs and some recurring costs, making your TCO calculation inaccurate.
Calculating TCO When calculating TCO, there are two categories of costs. Direct costs: These are the costs that you directly pay money for and that go on your balance sheet. They include the following:
• Hardware and software: This includes the costs of physical servers,
•
software licenses, spare parts and supplies, support contracts and warranties, and network bandwidth. You can extract these costs from your accounts payable systems or other sources such as invoices and purchase orders. Direct costs are generally well tracked and should be relatively easy to calculate. Operational costs: These costs include the following: The cost
of staff (full-time, part-time, and contractors) who are responsible for maintenance of the technology platforms The data center, colocation, network connection, security-related, and change management expenses Other costs such as training and skills upgrades Indirect costs: These costs are expenses that are not directly incurred. These costs are the consequence of using a technology platform. They include lost productivity due to downtime, inefficiencies due to lack of training, or the technology platform’s complexity requiring constant support. Indirect costs can be difficult to estimate but are very important if they significantly affect the overall costs. If your quick assessment shows that the indirect costs are minimal compared to the direct costs, then you can skip those costs in the calculation. Now that we understand what is involved with calculating TCO, a fundamental formula for calculating TCO is to add together all your direct and indirect costs: TCO = Direct Costs + Indirect Costs
Calculating Cloud Computing TCO Now that you have a high-level understanding of calculating TCO, let’s see what is involved with calculating your cloud TCO. Cloud migrations can be tricky; start by migrating a small workload or application first, identifying costs related to migration, and creating a larger plan.
Overview of the Cloud | 61
Assuming you have already selected a CSP (AWS, GCP, or Azure), these steps apply to a single application or workload:
• Identify a simple workload that you are considering migrating • • • •
to the cloud. The word simple means that it should not have any dependencies such as a network connection back to your data center or colocation facility to access data. Calculate the current direct and indirect costs and the TCO for this workload in your current environment. Use standard CSP-provided tools to calculate the TCO of IaaS and PaaS for this workload. Estimate your migration costs. When you migrate a workload, you have to estimate one-time costs such as moving data from the data center to the cloud, using external consultants, refactoring the workload to run in the cloud, integration, and testing. Estimate your postmigration costs. Once your workload starts running in the cloud, you will start paying monthly expenses and incur new fees such as training and documentation, change management, security, and compliance. You will also not need as many employees because there are no physical computer resources to manage.
After completing these steps, you should have a cloud TCO for your workload that you can compare with the TCO for running this workload in the traditional model. If the cloud TCO is lower than the traditional TCO, you reduce costs, but if not, you should look at optimizations to lower cloud TCO. Cloud computing has several benefits, and cost savings is one of them. Migrating to the cloud for cost savings is not a good idea because there will be cases where traditional models have a cost advantage over cloud computing. At this point, we will discuss ROI for the cloud and its relevance when justifying the cloud migration.
Cloud Return on Investment While TCO measures the cost of owning and operating technology, ROI is used to measure and justify the effectiveness of technology investments. It is commonly used to justify and prioritize technology investments by comparing them to currently existing solutions. The basic ROI calculation is to divide the net return from an investment by the investment cost and to express this as a percentage. The ROI formula is: ROI % = (Return − Investment Cost)/Investment Cost × 100
62 | Successful Management of Cloud Computing and DevOps
An investment in the cloud is more likely to be approved if its ROI is higher than the existing technology investment. Financially, it makes sense to select investments with the highest ROI first, and then those with lower ROIs. Showing that your cloud migration returns outweigh your existing technology can justify the cloud costs. There are two categories of returns that you need to evaluate before you calculate ROI:
• Financial benefits • Nonfinancial benefits Financial returns are easy and straightforward to calculate in the following areas:
• Increasing revenue: Does cloud migration increase revenue by • •
allowing you to enter new markets, add more customers, minimize downtime, or increase flexibility with upgrades? Lowering or avoiding costs: Does it reduce costs by optimizing the usage of computing resources? Does it avoid typical costs like extensive hardware maintenance and security costs? Reducing or avoiding capital expense: Does it eliminate or reduce the need to purchase additional computing resources? Does it eliminate the need to build data centers?
Nonfinancial returns are challenging to quantify. The real benefit of cloud computing is not the cost savings that it can bring, but the fact that your enterprise can react to business changes much faster and more effectively. It is hard to measure the value of agility and time-to-market, which are some of the benefits offered by cloud computing. We know that it is valuable when you can launch a product ahead of the competition and capture market share, but quantifying this value is difficult. Even if cloud computing TCO is higher than the current technology platform, agility and time-to-market value can increase revenue. These nontangible benefits will vary for your enterprise, and you will have to identify how much they contribute toward business growth. These are some examples of intangible benefits that you can evaluate:
• Improved customer satisfaction: This results in fewer customer • • •
support cases, which reduces your support costs. Entry into a new market: Using the cloud, you could quickly enter a new market, which results in additional revenue. Security and compliance: With the cloud, your workload’s security and compliance weaknesses are being addressed more effectively, increasing customer trust in your products or services. Time-to-market: Your product teams can innovate quickly and release products and updates faster, resulting in higher utilization, which increases revenue.
Overview of the Cloud | 63
The key is to be able to quantify the returns from nonfinancial benefits. Typically, ROI calculations for technology use a time frame of three years due to hardware life cycles. It is also essential to be consistent with the process that you used to calculate ROI. For example, if you are calculating ROI for three different migrations, keep all three calculations consistent. Your ROI will not always be accurate right down to dollars and cents; hence rounding off to the nearest thousands helps show the big picture. Other factors to consider when calculating ROI include the following:
• Existing investments: Migrating to the cloud will mean that you no • •
•
longer require some of these investments in the data centers and computer resources. These costs can’t be recovered and must be factored into the calculation. Skills upgrade: Migration and cloud management require new skills and knowledge. You will need to factor in the cost of training and upgrading skills or hiring personnel with the relevant skills. Inefficiencies: During the migration phase, you will continue to operate your workload in the current environment while deploying and testing it in the cloud. You’ll incur additional costs based on the duration of the migration. When you decommission your older environment after migration, you may be left with unused data center space and computer infrastructure that must be discarded. Costs of risk and compliance: You don’t have complete visibility into the potential security and compliance risks (e.g., meeting Germany’s C5 standard) that result from migrating to the cloud. If it turns out that you need to make additional investments in these areas, you must plan for these extra costs.
Common Cloud Misconceptions The need for speed and flexibility continues to drive enterprises to adopt the cloud, but this is plagued by misconceptions that have slowed the process since the beginning, impeding innovation. This section will cover some of these issues and provide context on why they exist and what to do about them. Multicloud (discussed earlier in this chapter) will prevent vendor lock-in, which means switching costs associated with replacing a vendor’s technology with another solution. In theory, there should be no lock-in with a specific vendor (i.e., the enterprise can replace any piece of its technology with some planning). However, in reality, the time and effort associated with switching from one technology to another can make it extremely expensive, making it seem like a lock-in.
64 | Successful Management of Cloud Computing and DevOps
Switching costs can be technology- or business-related; they have been around ever since IT organizations existed and continue to exist today. For example, suppose that an enterprise decides to switch from one DB technology to another for technical reasons, such as improved performance, or business reasons, such as a co-selling partnership agreement. In both cases, you will incur switching costs, such as the time and effort spent migrating your data and modifications to your application stack to accommodate the new DB. Using multiple clouds offers a number of business advantages, such as providing an enterprise with the knowledge and skills needed to operate in more than one cloud and offering some leverage when negotiating cloud contracts. Designing an application to work on multiple clouds is complex and expensive. The application must be developed using microservices architecture, containers, and cloud-agnostic tools in order to make it portable. Even with a portable design, you will still incur some switching costs, such as exporting data from one cloud to another. The cloud always saves you money. While this is often true, it depends on the workload and the application stack’s ability to take advantage of the cloud’s capability. If you have a fixed-capacity workload with minimal variance in usage, then moving such a workload to the cloud will not save you money. If the application has usage variance but cannot utilize cloud capabilities such as elasticity or on-demand usage, your costs will not decrease. You will have to refactor the application to benefit from the cloud, which will result in additional expenses. A full-scale migration from a data center to the cloud can save you a significant amount on the data center infrastructure and related costs such as data center personnel. However, adopting the cloud requires additional investments in training and hiring personnel with cloud-specific skills and expertise. Hence, you have to calculate and compare your current TCO with cloud TCO before deciding to use the cloud. The cloud is less secure than on-prem data centers. There is no straightforward way to debunk this misconception. All CSPs use a shared-responsibility model when it comes to cloud security. This means that the CSP and the enterprise users share the burden of securing their applications. Depending on what services (IaaS, PaaS, or SaaS) your enterprise consumes, the level of responsibility varies accordingly. For example, suppose that the enterprise chooses only to use IaaS services with AWS. In that case, the security of the computing infrastructure (i.e., servers, storage, and network) is the responsibility of AWS, and the enterprise still has to secure its applications. In most cases, securing your application includes patching and maintaining
Overview of the Cloud | 65
the operating system and all open-source and third-party commercial software. AWS will not take responsibility for any potential vulnerabilities in your application stack. In general, security in the cloud tends to be better than on-prem security. CSPs have invested extensively in meeting various government, countryspecific, and industry compliance requirements. Cloud security has evolved over the last few years, and it is very mature today. Most cloud-specific security breaches result from user misconfigurations, such as leaving the contents of an S3 bucket accessible to everyone. Transitioning to the cloud is complex. This depends entirely on the complexity of your application stacks and business requirements such as security and compliance. For a small enterprise with a few applications, a lift-andshift approach will suffice initially to transition to the cloud, and you can focus on optimizations after the initial migration. For a larger organization with several hundred applications with interdependencies and stringent security requirements, transitioning to the cloud can get complicated. In these situations, you will require sufficient up-front planning to identify which applications can and should transition to the cloud and how much effort is involved with a lift-and-shift approach as opposed to refactoring the applications. Once I move to the cloud, I am done. This statement is not entirely true. Adopting IaaS, PaaS, or SaaS reduces the overhead associated with managing infrastructure and software stacks, but it does not disappear altogether. Moving to the cloud provides enterprises with several benefits, but you still have to manage it. For example, with IaaS, you will reduce the overhead related to data center management, but you still have to manage the cloud infrastructure’s life cycle. The cloud provides a programmable infrastructure that allows you to automate the management of the infrastructure life cycle. However, you will need to invest in automation by training existing personnel or hire new personnel with cloud-specific skills. The applications running on an IaaS platform will need to be continually updated and patched, just as you would if they were running in a data center. The SaaS model significantly lowers the overhead associated with infrastructure and application management, but it still requires personnel to manage it. For example, you will need staff to manage all the enterprise users of SaaS, evaluate their credentials, and troubleshoot issues. You’ll need technical support teams to address fundamental questions on getting set up and accessing the SaaS platform. Depending on the SaaS platform’s sophistication, you will need personnel to customize it to meet your enterprise’s requirements.
66 | Successful Management of Cloud Computing and DevOps
The cloud is not reliable. The reliability of an application depends on its ability to handle faults and recover with minimal (or perhaps no) human intervention. If the application is designed to handle hardware and software errors, reliability should not be an issue. The cloud can be very reliable, but you have to design for reliability, just as in the data center environment. For example, if your entire application stack runs in a single AWS AZ, the application’s reliability is tied to the reliability of the AZ. Similarly, most CSP services offer an SLA for each service, and it is essential to understand the SLA while designing an application. For example, AWS EC2 provides 90% availability per hour (i.e., it can be down for no more than six minutes per hour). If you have a critical service running on a single EC2 instance, it could potentially be down for six minutes per hour. Having a second instance running in another AZ would increase your reliability drastically by reducing the AZ and the EC2 SLA risks. When it comes to reliability, the cloud is more flexible than traditional data centers. For example, all CSPs offer multiple regions and AZs that software architects can use to distribute their applications. Several services allow you to distribute underlying EC2 instances across multiple AZs and regions. For example, the AWS RDS service operates in a single AZ by default, with an SLA of 99.95% per month. However, it gives you the choice of selecting multiple AZs when provisioning, which further increases your uptime by protecting against a single AZ failure. AWS also takes care of automatically failing over to the redundant RDS instance without the need for the application to manage failures.
State of the Cloud In this section, we will briefly review the current state of cloud computing. Since 2000, the cloud (IaaS, PaaS, SaaS) has seen exponential growth, and in 2020, the total market capitalization crossed the $1 trillion mark. In 2019, IaaS revenue surpassed $100 billion,5 with the top three CSPs (AWS, Azure, and GCP) owning approximately 56% of the total worldwide public IaaS market share. The annual growth has been staggering, with AWS growing at 33.2%, Azure at 62.3%, and GCP at 67.7%. Based on Gartner’s assessment, AWS continues to dominate IaaS and PaaS and is the market leader, followed by Azure and GCP. Some of the current cloud trends are as follows:
• Enterprises are using multiple clouds, with 93% of enterprises having a multicloud strategy and 87% having a hybrid cloud
5
https://www.bvp.com/atlas/state-of-the-cloud-2020
Overview of the Cloud | 67
• •
• • •
•
strategy.6 When cloud computing started getting popular, most enterprises focused on using a single cloud. However, over time, they have realized the benefits of using multiple clouds or using data centers and a public cloud. In addition, enterprises have learned how to address the challenges of using multiple clouds, and several third-party solutions make it easier to manage multiple clouds. Public cloud adoption and usage continue to grow, with 20% of enterprises spending more than $12 million per year on public clouds. Cloud usage and spend management continue to be challenging across enterprises, with an average of 23% expected to go over budget. In addition, cloud spending is expected to increase by 47% in 2021. Cloud spend waste (e.g., unused resources) is at 30%, and 73% of enterprises plan to optimize the cost of their cloud spend. We will discuss the challenges with managing usage and cost in Chapter 4, but the critical point is that usage and cost management are complex. AWS, Azure, and GCP continue to be the top three public clouds, and Alibaba, IBM, and Oracle continue to catch up. Azure is narrowing the gap with AWS, and GCP has experienced the fastest growth of the three CSPs. PaaS usage continues to grow, with database-as-a-service and container-as-a-service growing across AWS, Azure, and Google. Shortages of skilled IaaS and PaaS personnel will continue to slow the migration to the cloud. Without a cloud knowledgeable workforce, enterprises will need to invest in extensive training to build knowledge and expertise in the cloud. In addition, AWS, Azure, and GCP clouds are very different, making it challenging to transfer knowledge across clouds quickly. CSPs are making investments in distributed clouds that will address the needs of low-latency applications. Distributed clouds will be placed closer to where the demand is and offer a subset of the cloud services.
Cloud computing deployment methods are constantly evolving, and the current trends will change depending on several factors, including global security and compliance and regulatory requirements across the United States; Latin America; Europe, Middle East, and Africa (EMEA); and Asia Pacific 6
https://www.flexera.com/blog/cloud/cloud-computing-trends-2021-state-of-the -cloud-report/
68 | Successful Management of Cloud Computing and DevOps
(APAC) regions. For example, country- or region-specific data residency and data localization requirements force the placement of data locally within a specific geographic boundary. Suppose that CSPs don’t have a presence in these locations. In such a case, enterprises that operate entirely in the cloud will be forced to rethink their strategies for using public clouds. This will shift the deployment models from public to private, hybrid, or back to traditional data centers. It is worth mentioning that the hybrid cloud model is increasingly becoming popular in larger enterprises. Enterprises migrating to the cloud for the first time have to deal with several constraints, including sunken costs into their data center environments (e.g., multiyear agreements with a data center or network service providers). The hybrid cloud model enables the enterprise to work around some of these constraints by providing a mechanism to use the public cloud as an extension of their data center. For example, the enterprise can use the public cloud to address short-term capacity needs or deploy specific workloads. This provides the flexibility to continue utilizing its investments in data centers and gradually transition workloads to the cloud as the contracts expire.
References National Institute of Standards and Technology. “Cloud Computing Overview.” NIST. https://csrc.nist.gov/projects/cloud-computing, Created December 1, 2016, Updated December 10, 2021. SolarWinds. “SolarWinds Security Advisory.” https://www.solarwinds.com /sa-overview/securityadvisory#anchor1, As of April 6, 2021.
Part II Technology
CHAPTER 4
Cloud Operations
U
p to this point, we have reviewed the fundamentals of cloud computing, its benefits, and various deployment and service models, and touched on sourcing and procurement, cloud strategy, cloud misconceptions, and future trends. Now we will explore cloud management, but first we need to understand how the cloud works. SaaS management is less complex than IaaS and PaaS and mainly involves user management, security, and customizations to meet enterprise requirements. Hence, we will focus on IaaS and PaaS management and will use AWS as our source for the examples.
How Does the Cloud Work? This section will review how the cloud works from the perspective of the consumer (i.e., enterprise users). In the discussion, we will focus on the high-level steps and dive deeper into the technology only when required. The first step when getting access to a public cloud service is to sign up for and access the cloud service. Anyone with an e-mail address, credit card, network connection, and a laptop or mobile device can sign up. AWS, GCP, and Azure offer online self-serve sign-up pages. Upon sign-up, users get access to a virtual data center with computing resources such as virtual machines (VMs), storage, and network, and services such as DBs, monitoring, and firewalls. Apart from the terminology, AWS, Azure, and GCP use a similar model for signing up and utilizing the cloud. This model makes it easy for anyone to get set up and begin working within a few minutes. It works well and is easy to manage if you use the cloud for personal reasons or if you are a small enterprise with a few accounts shared across multiple users. However, if you are a midsized or larger enterprise with a few hundred accounts 71
72 | Successful Management of Cloud Computing and DevOps
with a single CSP or multiple CSPs, cloud management gets complicated quickly. Let’s look at some of the challenges of using multiple accounts across an enterprise:
• Monthly payments: With multiple accounts using credit cards for
• • •
monthly invoicing, the enterprise has to track and process usage, cost, and numerous invoices. Each invoice will need to get charged back to a specific function or department based on the organization structure. Security: Each account has to be secured and protected to meet the enterprise’s standards or compliance requirements. Without centralized visibility and management, implementing security policies across all accounts can become challenging. Cost management: Cost-saving mechanisms such as reserved instance (RI) purchasing and savings plans offer more flexibility when multiple accounts are grouped and managed centrally. Governance: Without centralized management, there is no ability to conduct audits to enforce policies and standards.
CSPs designed public clouds for mass consumption with minimal or no integration outside the platform. Public clouds had fundamental features and functionality that met most user needs. As public clouds became popular over time, though, enterprises realized that they needed to integrate these public clouds back into their networks for authentication, security, and compliance. CSPs acknowledged the challenges associated with managing a significant number of accounts within an enterprise and responded by adding enterprise-specific features such as consolidated billing and centralized management capabilities. For example, AWS allows multiple accounts to be consolidated or linked to a single billing account, which enables the enterprise to receive and pay a single invoice per month for multiple accounts. AWS also introduced the concept of “AWS Organization” to manage multiple accounts centrally along with consolidated billing. Consolidated billing allows centralized invoice payment and management. AWS Organization enables the enterprise to organize accounts using organizational units (OUs), which can align with its organization structure (e.g., multiple teams, business functions), and apply security policies to each of these OUs. Enabling and using AWS Organization also allows an enterprise to create and customize accounts programmatically and centrally, which is a huge benefit to the enterprise for centrally governing hundreds or thousands of accounts. For example, the enterprise can programmatically create an
Cloud Operations | 73
account for a team, assign it to an OU, apply OU-specific security policies, and customize the account with basic security such as enabling logging, restricting services, and setting up remote monitoring. With AWS Organization, enterprises have more control over cloud accounts and can address the challenges related to security, governance, payments, and cost management. GCP and Azure offer similar approaches for managing, securing, and governing multiple accounts.
ITOps, DevOps, SRE, and the Cloud The term information technology operations (ITOps) refers to the set of services provided by the IT organization to its internal and external customers to meet business requirements. The services cover all the applications, infrastructure, security, governances, and processes that deliver value to the business. ITOps functions cover several areas, including infrastructure and application management, software development, backups, disaster recovery, laptop and desktop services, help desk, data center services, and governance. As software development teams started using Agile methodology for developing applications and products, they saw the need for continuous delivery of code with higher quality. This was possible only if the entire software development and deployment process was fully automated for speed and quality. In the ITOps model, software development and ITOps teams worked as two independent teams, operating in their silos, each with its own goals and objectives. The silos and lack of collaboration between these teams often resulted in poor planning and affected the product quality, which in turn affected the customers and the enterprise as a whole. ITOps struggled to keep up with the software development team’s demands. To improve collaboration and build better products, the DevOps model was introduced. DevOps methodologies were created to remove silos between development and ITOps teams by intergrating everyone and everything responsible for software development and deployment. This included people, infrastructure resources, processes, ITOps, business teams, computer resources, continuous integration and deployment (CI/CD) and security processes, and tools. The goal was to create a highly automated workflow with the responsibility for the rapid delivery of high-quality software. Google introduced the concept of SRE a few years before the introduction of DevOps. There is a lot of overlap between SRE and DevOps methodologies, but the key difference is while DevOps is all about what needs to be done, SRE talks about how it can be done. SRE and DevOps are
74 | Successful Management of Cloud Computing and DevOps
Table 4.1 Five DevOps pillars and SRE practices. DevOps
SRE
Reduce organization silos Share ownership with developers by using the same tools and techniques across the stack. Accept failure as normal
Have a formula for balancing accidents and failures against new releases.
Implement gradual change
Encourage moving quickly by reducing costs of failure.
Leverage tooling and encourage automation
Encourage automation and minimizing toil to focus on efforts that bring long-term value to the system.
Measure everything
Believe that operations is a software problem and define prescriptive ways for measuring availability, uptime, outages, toil, etc.
two different approaches with a common objective: delivering high-quality software fast by breaking down silos between software development teams and ITOps. Table 4.1 illustrates the five DevOps pillars and the corresponding SRE practices. The following section will introduce cloud operations and explain how cloud operations is similar to and different from DevOps and SRE.
Organization Structure and the Cloud Three popular organization structures and how cloud operations fit into them are shown in Figure 4.1. IT-led structure. This structure usually applies to enterprises that utilize IT services such as ERP, payroll, point-of-service (PoS) systems, and order management, with no e-commerce services. In this structure, all technology services are centrally managed by IT, and the other functions, such as HR, finance, sales, and marketing, depend on IT to deliver services. This structure uses the cloud to provide these services and hence owns all responsibilities for the cloud within the enterprise. In this organizational structure, IT owns the cloud strategy and cloud governance, and cloud management is embedded within the organization.
Cloud Operations | 75
Sales
Finance
Business Unit
HR
Business Unit
IT
IT
Cloud Operations (a)
AWS
Business Unit
Cloud Operations GCP
Business Unit
Azure Business Unit
(b) Business Unit
AWS Business Unit
GCP
Azure
IT
Cloud Operations (c)
AWS
GCP
Azure
Figure 4.1 Popular organization structures.
Business-unit-led-with-IT-support model. In this structure, an enterprise has separate business units (BUs) to support various product lines. BUs are responsible for profit and loss (P&L) and have a lot more autonomy in making technology choices. BUs utilize IT as the central resource for technology and hosting services. They may define the requirements, but IT designs, builds, operates, and supports applications and e-commerce sites. Product teams may also choose to design and develop applications, but they depend on IT for operations. There is a close partnership and shared responsibilities between the BUs and IT in this organizational structure. For example, a BU may choose a specific CSP because of a particular feature or capability, such as advanced AI capabilities, and expect IT to support its application on the CSP’s platform. IT no longer has end-to-end responsibility for the cloud in this structure. IT needs to consult with the BU on cloud strategy and have product teams involved with cloud governance and cloud management. However, since IT is still responsible for operating the cloud, cloud management can be embedded within IT. BU-led-without-IT-support model. This structure applies mainly to technology companies that provide online services such as SaaS applications to their customers. In this model, BUs have the responsibility for P&L and design, build, operate, and support their SaaS application. They often choose the CSP based on their application requirements. IT is mainly responsible for back-end services such as ERP, payroll, and laptop support, and internal applications to support HR, marketing, and sales. IT may also choose to
76 | Successful Management of Cloud Computing and DevOps
deliver services via the cloud and use a CSP for that function. This organizational structure makes cloud management complex and requires a central team responsible for cloud strategy and governance across the enterprise. Without an enterprise-wide strategy and governance, IT and each BU may end up negotiating separate contracts with CSPs and hence take charge of the security and management of the cloud.
Why Manage the Cloud? Cloud governance is an agreed-upon and well-defined set of policies and standards aiming to reduce enterprise risk by providing a governance framework that can be enforced via periodic audits, measurements, and reporting procedures. Governance also defines the process to address failures and identifies the decision-makers responsible for mitigation and communication. Cloud governance is not something that you define once and that is it. It needs to be continually monitored and adjusted as the enterprise progresses in its cloud adoption process. Cloud governance should address the following fundamental areas:
• Create enterprise-wide, enforceable policies that address cloud •
• •
• • •
adoption and use: With multiple functions of the enterprise using different clouds, there must be standard and enforceable policies to provide guidance and protect against enterprise risk. Define criteria for the selection of CSPs: Enterprises select a particular CSP for a specific reason. It could be for technical or business reasons, but defining enterprise-wide standards mitigates the risk of having cloud sprawl (proliferation of cloud instances, services, or providers), which can become a huge risk. Define best practices for multicloud and workload distribution. Provide guidance and best practices on addressing security and regulatory compliance: Even with the cloud following the sharedsecurity model, enterprises still have to be deeply involved with protecting their applications and data in the cloud. Security and compliance best practices per cloud should be centralized. Define financial management governance: The enterprise may have migrated to the cloud for cost reasons. However, without standards to measure cost savings, the enterprise will not be able to tell if it is meeting this objective. Drive cloud transformation across the enterprises: Apart from providing best practices, cloud adoption requires training and skills upgrades to keep up with continuous changes in CSP services. Address security and regulatory compliance across all functions.
Cloud Operations | 77
• Define best practices for managing cloud usage and costs. • Define criteria for the selection of CSPs for use across the enterprise. • Provide best practices for a wide area of topics ranging from user •
management to standardized architecture. Provide guardrails and guidance during the entire life cycle of cloud adoption and use.
To ensure successful adoption and use of the cloud, enterprises must have the right skills and governance structure. With multiple enterprise functions utilizing the cloud and controlling their budgets, the focus must shift from cloud control to cloud governance. The easiest way to achieve this is by setting up a centralized team responsible for defining an enterprise-wide cloud governance framework. Gartner and others refer to such a group as a cloud center of excellence (CCoE), a centralized governance team supporting all functions across the enterprise. The CCoE must be unbiased and flexible in creating a governance framework to support all the requirements of cloud consumers. Public clouds such as AWS, GCP, and Azure are designed for easy consumption, but adoption across enterprises can become complicated due to organizational politics. Let’s review where CCoE will exist in the three organization structures defined previously. In the IT-led structure, IT delivers enterprise services using the cloud and charges the functions for the services they provide. In this model, IT owns the cloud budget and is responsible for paying the monthly invoices for the services. There are no external dependencies, so IT owns cloud governance and management. In this situation, the CCoE should reside within the IT organization. In the BU-led-with-IT-support model, the BUs play a role in selecting the CSP and are involved with their application stacks’ daily operations and cloud governance. In this model, IT must have a partnership approach and share the responsibility of the CCoE with the business units. This function can reside within IT or under the BUs. In the BU-led-without-IT-support model, the BUs and IT make independent decisions on CSPs, including the cloud’s selection and day-to-day operations. In addition, each function is responsible for budgeting, cost management, invoice payments, and governance. In this model, the CCoE must reside outside of IT and the BUs but have representation from both IT and the BU. The objective is to keep CCoE independent of organizational politics and avoid any single function from prioritizing their requirements. A CCoE provides the governance framework to implement the cloud strategy successfully. The CCoE should be the primary vehicle for leading and governing cloud adoption across all service models—IaaS, PaaS, and SaaS.
78 | Successful Management of Cloud Computing and DevOps
It is critical to note that the CCoE is a team that provides leadership, best practices, research, support, and training for the cloud. CCoE is not responsible for day-to-day operations of the cloud (i.e., CCoE does not manage the cloud). Enterprises will have to dedicate a cloud operations team to enforce the policies and standards defined in the governance framework.
Introduction to Cloud Management Cloud management refers to how an enterprise controls and orchestrates all the computing resources and services to deliver a product or service. Typically, cloud management includes CSP account management, user and access control, data, security, applications, resource management, usage and cost monitoring, and service management. When an enterprise has access to a cloud account, it is similar to having access to multiple interconnected data centers across the globe, with a large set of resources that can be utilized within a few minutes. Typically, enterprise users will have access to all resources across multiple locations or geographies that the CSP supports. For example, AWS offers an object storage service called S3, which allows users to create buckets that are similar to folders in an operating system. Users can create these buckets in an AWS region and manage them using the S3 service. The S3 service enables users to configure and manage buckets utilizing a user interface via a web browser or programmatically via application programming interfaces (APIs). CSP typically supports APIs for programming languages such as Java and Python. Resources such as an S3 bucket have a life cycle, and managing the resource across its life cycle is part of cloud management. Cloud management comprises two areas:
• Resource life cycle management: This is the technical management of the cloud. It is usually owned by software engineers, DevOps, and SRE teams.
The typical life cycle of resources consists of the following actions: Provisioning Configuration Operation
and maintenance and reporting Decommissioning Monitoring
• Cloud operations: The focus is on centrally managed, business-
focused parts of the cloud, including meeting enterprise governance requirements such as cost, security, and compliance.
Cloud Operations | 79
Introducing Cloud Operations While DevOps and SREs focus on automating the software development and deployment workflows to deliver high-quality products, cloud operations is more focused on the business aspects related to adopting and using a cloud platform within an enterprise. The responsibilities of cloud operations vary depending on organization structure and the level of complexity. For example, if your enterprise uses a single CSP with a few accounts, cloud operations’ responsibilities are minimal. However, if you are a midsized or large enterprise with hundreds of users across multiple business units and use more than one CSP, the responsibilities of cloud operations increase significantly. In Chapter 3, we discussed the importance of having a cloud strategy. It should outline who is responsible for governing the cloud. Some organizations task this responsibility to a central team, such as a CCoE. This team is responsible for defining the cloud governance framework to steer an organization during the cloud’s discovery, adoption, and operation phases. A cloud governance framework is similar to an IT governance framework and is not a new set of concepts or practices. The IT governance framework defines the approaches and methods through which an enterprise can implement, manage, and monitor IT governance within an enterprise. It provides guidelines and measures to utilize IT resources and processes within an organization effectively. A cloud governance framework defines the approaches to institute proper controls and optimize cloud services within an enterprise. The cloud governance framework must have these pillars:
• Financial management • Operations management • Security and compliance management • Data management Responsibilities of Cloud Operations Cloud operations is not a replacement for DevOps or SRE. DevOps and SRE methodologies focus on building high-quality products and have responsibility for everything related to designing, building, and operating a product or service, including governance and risk management. The way that software gets designed, built, and deployed differs between a cloud environment and traditional methods, but the DevOps and SRE methodologies remain the same. However, the cloud introduces additional requirements in an enterprise that did not exist in the traditional methods.
80 | Successful Management of Cloud Computing and DevOps
With multiple functions using multiple accounts and CSPs, cloud management will become extremely complicated without centralized governance. Let’s use an example to illustrate this complexity. An enterprise has functions that use AWS as the cloud platform. They begin with one account per function and soon realize that it is a bad idea to run multiple environments in a single account. Hence they start creating more accounts to isolate environments. This leads to hundreds of accounts across the enterprise, with each function being responsible for the governance of these accounts. Apart from managing the application life cycle, DevOps and SRE teams also have to focus on cloud governance. They can govern, but their focus is on producing high-quality applications, not governance. Should one of the functions be responsible for the governance of all the accounts? It could do this, but the organization’s charter is to focus only on its functional priorities, and it doesn’t have the required skills or knowledge in account life cycle management, including cost management and other areas. In a midsized or large enterprise, managing a cloud environment requires the setup of the following shared services that are common across all CSPs:
• Cloud account life cycle management • Cloud account identity management • Cloud audit and compliance management • Cloud usage and cost management and resource optimization • Cloud architecture best practices • Cloud operational best practices • Cloud connectivity management • CSP and tools selection • Workload distribution • Skills and training • Migration assessment and guidance • CSP management (e.g., escalations, quarterly business reviews, •
credit management) Shared services (Git repositories, standard templates)
Cloud Account Life Cycle Management As discussed earlier, the creation and management of cloud accounts (or “projects” and “subscriptions,” as they are called in GCP and Azure) are more manageable if you plan to use just a few accounts across the enterprise. For example, if you have three separate accounts for Development, Quality Assurance, and Production, it is straightforward enough to manage three accounts. However, if you plan to use hundreds or thousands of
Cloud Operations | 81
accounts distributed across the enterprise among several functions, you need a strategy for handling the life cycles of all these accounts. The life cycle of an account includes the creation, management, and decommissioning of the account. A security breach in any of the accounts can result in monetary loss due to misuse of resources or public exposure of your applications and data. Account security is an integral part of the account management life cycle. To manage the life cycle of an account, you want to address some of the challenges related to these areas:
• Account creation: AWS accounts can be created by anyone using
an e-mail address and credit card. By default, when creating an account directly using AWS’s sign-up Uniform Resource Locator (URL), enterprise single-sign-on (SSO) is not provided as a choice, and hence users can use any e-mail address to sign up. Enterprises must provide a mechanism for users to request and receive an AWS account configured to only allow enterprise SSO to AWS. Without such a mechanism, accounts created by enterprise users are considered private to the user. If the user leaves the enterprise, you now have a situation in which a nonemployee has complete control over an enterprise account. In this scenario, it isn’t straightforward to have AWS grant access and ownership of the account to the enterprise. AWS considers accounts with private e-mail addresses as private. In another scenario, if there is no centralized invoicing setup and the administrator of an account fails to pay the monthly invoice, AWS shuts down the account after a period of time, which may result in your application being shut down. Recovery of an account that has been shut down due to nonpayment can take a few days, during which your application will remain down. To address these challenges, AWS has introduced AWS Organization (Figure 4.2 on the next page), an account management service that lets you consolidate multiple AWS accounts into an organization that you create and manage centrally. It allows you to manage accounts using a hierarchy similar to your organization structure by creating multiple groups of accounts based on your organization’s functions (e.g., BU, IT, Sales).
AWS Organization allows centralized governance of your AWS accounts and offers a number of capabilities, including the following:
• Centralized management of all of your AWS accounts: You can
create accounts that are automatically a part of your organization, and you can invite other accounts to join your organization. You also can attach policies that affect some or all of your accounts.
82 | Successful Management of Cloud Computing and DevOps
Management account
IT OU
Security OU
Member account
Member account
Member account
Member account Member account
Business Unit OU
Business Unit-1 OU
Business Unit-2 OU
Member account
Member account
Member account
Member account
Member account
Figure 4.2 AWS Organization structure.
• Consolidated billing for all member accounts: The enterprise • •
•
receives a single invoice from AWS across all accounts. Hierarchical grouping of your accounts to meet your budgetary, security, or compliance needs: You can logically group accounts under organizational units to simplify management. Policies to control AWS services: Enterprises can control what services are accessible across all accounts. For example, you can apply a policy that restricts the use of S3 service across all accounts. If such a policy gets applied, all users across all accounts will not have use of the S3 service. Policies that configure automatic backups for the resources in your organization’s accounts: Such policies enable and manage backups centrally.
A complete list of capabilities can be found in the AWS Organization User Guide.1
1
https://docs.aws.amazon.com/organizations/latest/userguide/organizations -userguide.pdf
Cloud Operations | 83
AWS Organization enables the enterprise to create AWS accounts programmatically via APIs. This allows centralized creation of accounts within an enterprise, thus preventing individual users from signing up using private e-mail addresses and credit cards. Programmatically creating accounts provides more flexibility to the enterprise and enables adding controls and governance processes, such as creating an approval workflow whenever an account gets provisioned. In addition, the newly created account can be customized to meet the enterprise’s governance requirements, such as creating a security and audit role that enables DevSecOps and audit teams to monitor the account for breaches or policy violations. The recommended best practice is to have the cloud operations team be responsible for account management and governance across the enterprise. The cloud operations team’s responsibility is to ensure a low-risk, centralized mechanism to create accounts across multiple CSPs. Leaving account creation and management processes to each function to manage will result in security exposures and the necessary overhead to the organization for managing users and monthly invoice payments.
• Account management: Once you have hundreds of accounts, these
•
accounts will require daily, weekly, and monthly management. When creating an account, the enterprise can apply best practices via automation to secure and govern the account. For example, you can configure central logging of all actions performed across your organization using AWS CloudTrail, which would enable information security teams to track potential security breaches. In addition, you can configure the account to monitor usage and cost via third-party cost management platforms such as CloudHealth and Turbonomic. Account decommissioning: When an account is no longer needed, it should be closed to prevent unauthorized access and usage. The enterprise can also define approval workflows to prevent the accidental decommissioning of accounts and take all the necessary steps before closing an account. For example, the cost management platform and other systems monitoring the account must be adjusted before it is closed. In addition, you have to monitor unpaid charges in these accounts, which have been reconciled if the account is closed during the monthly cycle.
Cloud Account Identity Management Most enterprises use an SSO service to authenticate and authorize their users. CSPs provide their authentication mechanisms (e.g., AWS IAM) and have realized that they need to integrate with enterprise SSO. AWS offers SSO integration in conjunction with AWS Organization. SSO integrations
84 | Successful Management of Cloud Computing and DevOps
allow an enterprise to manage cloud accounts using enterprise credentials, which simplifies cloud management. For example, once an employee leaves the enterprise and their enterprise account is disabled, they automatically lose access to all their cloud accounts. Enabling and managing SSO between the enterprise and AWS are done via AWS Organization and must be centralized to utilize this capability. In addition, issues related to SSO failures must be addressed by a single team that is responsible for the CSP SSO setup.
Cloud Audit and Compliance Management Governance works when it can be enforced, and you cannot enforce it without being able to audit for compliance. While the CCoE defines policy and standards, it will need a team to conduct audits and make sure that enterprise accounts and users are following policy and standards. Most large enterprises have internal audit teams, but the cloud is relatively new, and audit teams are still in the process of understanding it. In addition, audits are considered overhead on the enterprise, as they take up a lot of time and effort. However, it is the only means by which you can measure if your governance framework is working. Typically, IT audits are conducted once a year, with sufficient time to address any identified issues. Cloud environments must be audited more frequently to prevent security and other issues such as cost overruns. Typically, auditors use a set of tools and processes while conducting audits. Having cloud operations assist with audits by automating data collection and interpreting logs speeds the process significantly.
Cloud Usage and Cost Management and Resource Optimization The cloud provides much more flexibility than traditional data centers. The cloud provides elasticity and on-demand usage, but IaaS does not give you complete visibility into the underlying hardware or network. In the traditional data center model, the enterprise has complete control over the infrastructure and can troubleshoot the entire stack, including switches, routers, load-balancers, firewalls, network cards, and application layers. In the cloud, IaaS does not provide this flexibility. Further, if you are not using a dedicated host for your EC2 instance, you will share it with several other tenants of the CSP you are using. When deploying an application in the cloud, you have three options:
• Lift-and-shift: This approach does not involve any changes to the
architecture itself; rather, it involves replacing components of your computer infrastructure. For example, you will replace your physical servers and storage with EC2 instances and elastic block storage (EBS) volumes. This enables you to access the cloud quickly, but your application is not optimized.
Cloud Operations | 85
• Cloud-optimized: In the cloud-optimized approach, an existing
•
application can benefit by adopting cloud services to replace existing architecture components. For example, you may replace your expensive on-prem MySQL cluster with AWS Relational Database Services (RDS), which reduces costs and improves performance. Cloud-native: Cloud-native applications are usually designed from scratch to use several services of the cloud. Existing applications will have to be entirely redesigned to use native cloud services. The applications fully benefit from on-demand usage, scalability, managed databases, monitoring security, scalability, security, and cost.
In the traditional data center model, scaling an application vertically or horizontally involved detailed planning and days, weeks, or months of waiting for hardware and software upgrades. Applications were not usually designed to deal with these nuances. In the cloud, vertical and horizontal scaling can happen in a few minutes, but you need to design your application to utilize this capability. Without cloud best practices, all data center practices will trickle into the cloud. Hence, you will need architecture best practices to guide teams as they migrate to the cloud or design cloud-native applications. In the following sections, we will review some of the architectural best practices focusing on AWS.
Managing Failures Applications can fail for multiple reasons. It becomes the responsibility of the software engineering and DevOPS or SRE teams to design an application to minimize failure and recover gracefully from a failure if it does happen. When designing applications for the data center, you have full visibility of the infrastructure constraints such as capacity, performance, and redundancy. Your application is designed to meet SLAs or OLAs, with all these constraints in mind. For example, your application may demand a very high SLA for the database, and hence the database infrastructure may be designed for 99.999% availability. Each component of your architecture is designed to meet the application’s requirements, and therefore you have complete control. In the cloud, you don’t own or operate the infrastructure and do not have full visibility. Further, you cannot define or control the SLA of any CSP. For example, AWS EC2 has a monthly SLA of 99.99% availability. With the SLA for EC2 and other services known, your application must be designed around the SLA constraints. A common approach to increasing availability is horizontal scaling. However, without understanding the cloud’s infrastructure layout, the design may not deliver the expected SLA. In the traditional model, you
86 | Successful Management of Cloud Computing and DevOps
would horizontally scale by increasing the number of servers (web servers or application servers) and configure a load-balancer to balance traffic across these servers. You can accomplish the same in the cloud, but you can do it a lot more efficiently by automatically scaling up and down dynamically, driven by demand. CSPs provide multiple regions with AZs in each region. Understanding the nuances of using multiple regions and AZs will help make better decisions. For example, AWS has different costs for different regions, and there are constraints on RI purchasing that can lock you into a specific region. Without understanding and addressing these constraints, you will not meet your SLA all the time. For example, to increase availability, it is a common practice to run workloads across multiple regions. However, if you did not make RI reservations in these regions, there is no guarantee that your autoscaling will work during peak demand. Another example is with the RDS service, which does not back up or replicate the data across multiple regions by default. If you need to recover from a disaster, you have to make sure that the RDS instance is set up with multiregion replication.
Managing Security Security is another area over which you had full control with traditional data centers. In most cases, enterprises have dedicated security teams responsible for protecting their applications from external and internal attacks. Your entire application stack was behind a firewall. In the cloud’s sharedresponsibility model, IaaS and PaaS security are usually the CSP’s responsibility, and the enterprise is responsible for protecting the application. The cloud also provides hundreds of services that speed up application development, but they have their security nuances to deal with. For example, you can offer downloads via Hypertext Transfer Protocol (HTTP) using an S3 bucket, but the S3 bucket must be configured and secured by the enterprise or else you risk exposing sensitive data. You need best practices to define how to secure your application and each of the enterprise’s services. The cloud is a multitenant environment, so data must be protected at all times during transit and at rest. Enterprises will need an encryption strategy to ensure that sensitive data is always encrypted and that access is enforced and monitored. CSPs also provide distributed denial of service (DDoS) and web application firewall (WAF) services that need to be considered to protect all enterprise accounts.
Cost In the traditional model, the enterprise had a finite computing capacity. You knew beforehand how much volume you could handle before your application failed, as well as how much it cost to operate your application at full capacity. In the cloud, you consume resources dynamically and are billed
Cloud Operations | 87
for what you consumed during a previous period, but you could potentially overconsume resources without realizing the cost up-front. For example, egress bandwidth is costly in the cloud, and any rookie mistakes in application design can cost thousands of dollars in monthly charges. Without best practices on securing and monitoring egress access, cloud costs will add up quickly. All of these scenarios require architectural best practices that can be common across the enterprise. AWS provides the well-architected framework, and other CSPs have their own recommendations.
Cloud-Operational Best Practices Like architectural best practices, when designing cloud applications, you have to follow best practices in order for operational teams to manage the application efficiently. In the traditional model, the enterprise provided standardized services for logging and monitoring. However, in the cloud, you have the option of using the CSP-provided service or creating your own solution using open-source or commercial software. If each application team chooses a different method, logs will be distributed across multiple systems, especially security logs, which will be required for forensics after a breach. If your enterprise has a central security team, it will get overwhelmed by using multiple solutions. Identity and access management is another area that requires best practices. CSPs provide multiple mechanisms to manage identity and access control and leave the decision to the enterprise. Using a single mechanism such as enterprise SSO to access the cloud is one of the best practices.
Cloud Connectivity Management You can have applications that are entirely deployed in the cloud with no external dependencies, or those that must connect back to the enterprise. There is no need for a secure network connection from the enterprise to the CSP in this situation. In a hybrid environment, an enterprise may choose to keep a critical application such as ERP in its data center and have other applications run in the cloud. In this situation, you will need a secure connection from the cloud back to your enterprise data center. As easy as this sounds, you will need to plan for redundancy and high availability of these connections to multiple CSPs in a multicloud environment. You have to manage the security and balance capacity requirements of these network connections. In addition, you will require governance and internal chargebacks for functions using these connections. Finally, you have to monitor and troubleshoot issues with these connections, as it could affect the SLA or OLA of your cloud applications if the link goes down.
88 | Successful Management of Cloud Computing and DevOps
CSP and Tool Selection In a multicloud environment, for every new application that needs to migrate or get built in the cloud, a decision must be made as to which CSP is the best fit. While all CSPs offer basic computing resources (computer, storage, network), some CSPs differentiate themselves with their offering and services. For example, AWS has the broadest portfolio of services, making it easier for software development teams to build applications. Google offers premium network services utilizing its global network backbone, making it attractive for enterprises that require such connectivity. Azure integrates with enterprises that utilize Microsoft solutions such as Office 365. Apart from the technical advantages of each CSP, the enterprise has to look at strategic partnerships with CSPs and address customer demands, which are two other dimensions to consider when selecting a CSP. For example, an enterprise may want to enter into a partnership agreement with Azure to integrate and sell a solution that works on Office 365. This would require that the application run on Azure, and hence Azure might become the preferred CSP. Enterprise customers may also demand that a specific service be hosted on the preferred CSP for multiple reasons, including security, performance, cost, or reduced complexity. This could drive an enterprise to introduce multiple CSPs to accommodate their customers’ requirements. With regard to tools selection, it is rare for an enterprise to utilize all the services provided by a CSP. For example, an enterprise may choose to complement the CSP’s services with external tools or technology. CSPs offer public cloud marketplaces that enable enterprises to search, select, purchase, and deploy third-party tools within minutes. In such a situation, having multiple third-party solutions without any governance can create overhead for operations teams, such as the security and audit teams. For example, if an enterprise utilized multiple third-party logging solutions, it would be challenging for central security and audit teams to use various monitoring and auditing sources. The enterprise will need to have a preferred supplier list for third-party services that the teams can choose from, limiting multiple tools.
Workload Distribution Workload distribution across CSPs is another strategy that enterprises need to consider when using multiple clouds. The enterprise must establish firm guidance, based on use cases, about when to use one CSP over another. These cases could be based on business considerations (e.g., selling on the CSP’s public cloud marketplace) and technical benefits (e.g., better security and compliance services). Without clear guidelines, workloads will be distributed arbitrarily across multiple CSPs based on the developer’s preference,
Cloud Operations | 89
knowledge, and familiarity with the CSP services rather than business objectives. In some cases, CSPs promote their services and influence decisions by directly interacting with product and engineering teams. Lack of a workload placement strategy and guidance will result in application and data integration issues due to increased operational complexity. Described next are some factors that influence the workload placement use cases:
• Functionality: All CSPs offer basic computer services; they
•
•
•
differentiate themselves by providing specific capabilities that may help an enterprise innovate faster or gain an edge over the competition. For example, if you have a streaming media application that would benefit from a global low-latency, high-speed network that can interconnect several locations, GCP would be a better fit. GCP is currently the only CSP to offer a premium tier option for networks. The premium tier unlocks GCP’s high-performance and low-latency global network. Traffic is prioritized and routed through the fewest hops via the fastest paths to accelerate transport speeds and increase security. All network-intensive applications can benefit from GCP’s premium tier network service, and it is a good use case for functionality-based workload placement. Locations: The enterprise is continuously looking to enter new markets and select new options to deliver services within a geographic area for latency reasons or data sovereignty requirements. In this use case, a decision to use a specific CSP would depend on the locations at which the services are available. For example, if your business wants to tap into the South American, Australian, or African market, Azure has more locations in these parts of the world than AWS and GCP. Manageability: It takes a combination of tools, technologies, and expertise to manage an application stack. There is no consistency in features and functionality across CSPs, making it highly complex to manage an application running in multiple clouds. Manageability also depends on the skills and knowledge related to a specific CSP’s environment and the quality of its support and professional services organization. These challenges limit an enterprise to a single CSP for larger and complex applications, with smaller applications distributed across other CSPs. Data placement: If your enterprise workloads share data across multiple systems, all the related workloads must reside in the cloud to avoid excessive data egress charges, latency, and data security issues.
90 | Successful Management of Cloud Computing and DevOps
• Direct cost: Overall cost is an essential factor to consider when
adopting a cloud, but cost-based workload placement should not be the primary factor. Cost-based workload placement could reduce monthly IaaS and PaaS costs in the short term, but the operational complexity may increase in the long term. Hence, it is vital to calculate cloud TCO and compare all other factors before using cost as a primary factor.
Given multiple business and technical factors that need to be considered before defining a strategy for workload placement, it is essential to have a strategy that works for all the various functions that will use the cloud. The factors described here may not apply equally to all the functions, and hence the workload placement strategy must accommodate exceptions. For example, there are significant benefits of operating Oracle-specific workloads in Oracle Cloud Infrastructure. You must have an exception process to accommodate teams that require such a capability. Understanding enterprise-wide requirements and defining use cases for workload placement across multiple clouds will reduce overall operational complexity.
Skills and Training For successful adoption and cloud usage, you need to have relevant skills, knowledge, and expertise with the cloud. Without investing in continuous learning and training, it is hard to keep up with the CSP’s innovation. The gap in knowledge increases significantly when an enterprise decides to use multiple CSPs. The traditional data center model has been around for decades, and enterprises and industry have invested heavily in training and hiring personnel who are well versed in data centers, infrastructure, security, network management, and ITOps. The cloud is still relatively new compared to traditional data centers. It requires different sets of skills, requiring additional training and hiring new personnel to address the gaps. For example, here are some areas for training:
• Sourcing and procurement teams will need to upgrade their skills • •
and understand how IaaS, PaaS, and SaaS work and negotiate contracts for public clouds. Legal teams have to understand how security, compliance, country, and government-specific laws work when you use public clouds. Finance teams are typically used to a CapEx model, which did not change much on a monthly or quarterly basis. Capacity was added using capital and depreciated over multiple years. The cloud is a mix of CapEx and pay-per-use, and finance teams need to understand
Cloud Operations | 91
•
how to plan and allocate monthly and quarterly expenses and for dynamic capacity management and cost allocation. Security and DevOps teams have to deal with a model where they don’t have complete control over underlying infrastructure or services. Yet they have to deal with the same degree of risk and shared responsibility as the CSP. While most security and operational principles remain the same, these teams need to adapt them to the cloud.
Management and leadership teams need to understand the benefits, risks, and challenges of using the cloud and adjust their business requirements accordingly. They need to understand the cloud’s common misconceptions and make informed decisions, consulting with subject matter experts whenever required.
Migration Assessment and Guidance Migrating applications from traditional data centers requires thorough assessment and planning. The assessment process should evaluate applications to see if there are benefits to migrating to the cloud and should eliminate applications that do not need to migrate to the cloud, thus saving time and effort. It’s best to define a basic set of criteria in a migration strategy document that can be used to identify applications that should be migrated. This approach will remove ambiguity across the enterprise and provide guidelines across functions. CSP management. Imagine a situation where you have three CSPs that your enterprise uses, and you have several hundred accounts across each CSP. Given this scenario, you must decide who is responsible for the following:
• Tracking and reporting on CSP performance • Change management when CSPs make changes to their policies or • • •
terms and conditions Making sure that credits, discounts, RI purchases, and savings plans are applied as expected and resolving discrepancies with invoices Monthly invoice validation, processing, and chargeback Escalating enterprise-specific priorities to the CSP and making sure that they get addressed
Each function could own this at a functional level, but there will be a lot of redundancy and wasted time and effort. Or the enterprise can centralize this
92 | Successful Management of Cloud Computing and DevOps
within cloud operations and make sure that cloud operations are responsible for meeting all the functional requirements and escalations. Typically, in a larger organization, cloud operations would host a quarterly business review with each of the CSPs and bring together all their functions to review performance and address concerns. This model is more efficient and reduces the cross-talk between multiple functions and CSPs. Shared services. The cloud is entirely programmable—that is, you can write code to automate the provisioning of infrastructure, configure and customize it, and deploy your entire application stack as code. Code can be shared and reused with the enterprise, making it easier to share knowledge and skills. However, it requires shared tools such as source code repository, wikis, and case management tools, which the developers, DevOps, SRE, and cloud operations teams can use to share templates, knowledge, and best practices. Shared services also include all the central services required for a CSP to be managed, including identity and IAM, account life cycle management, procurement services, security and compliance, and auditing. A central team should take the responsibility to provide this service to the rest of the functions.
Cloud Usage and Cost Management Enterprises using the traditional data center model have dedicated teams that manage IT CapEx and OpEx budgets. In addition, procurement is responsible for negotiating contracts and purchasing hardware, software, and services, and accounts payable teams take care of the payment of invoices. Well-defined policies and standards govern this entire process. IT teams are responsible for planning and overseeing IT budgets and funding initiatives. In other functions such as business units, a similar model exists where the business operations or finance team is responsible for managing and reporting on the quarterly or annual budgets. The cloud’s benefits, such as the flexible and variable consumption model, completely change how budgets get managed. IT planners, finance, and procurement no longer have control of critical decisions such as usage and cost and often do not understand the cloud model and grapple with managing the cost of using the cloud. This has created a need for a cloud-specific service that focuses on addressing cloud usage and cost gaps across the enterprise. Cloud usage and cost management is a specialized service that focuses on managing and optimizing cloud resource usage and cost. This service is also called FinOps practice.
Cloud Operations | 93
Cloud adoption across an enterprise introduces new challenges, and managing cloud spending is among the most difficult of these. In the data center model, you had fixed capacity, and hence set costs. However, in the cloud model, cost varies per hour depending on several factors, such as the size of the resources, location, and the number of resources consumed every hour. This variability makes it extremely hard for finance teams to forecast and budget, making quarterly planning cycles complex and challenging. Further, in a hybrid model where parts of the application still reside in a data center, calculating TCO for a specific application involves set (data center) and variable (cloud) costs. On the positive side, the cloud provides granular visibility into computing costs, showing resource utilization and productivity by the hour. This presents the finance and operations teams an opportunity to continually monitor and increase resource productivity by optimization. Public cloud usage and cost management are complex because of the following factors:
• Complex pricing structures: There are hundreds of types of
•
• •
•
resources and several ways that a resource can be provisioned and consumed, making it complex to understand pricing and billing models for each combination of resources. For example, AWS EC2 has several instance types, each with a different size, and all of these can be discounted in multiple ways, with several payment options. We will cover EC2 in detail later in this section to show how cost optimization works. Organizational structure: Enterprises can use a centralized or decentralized approach toward managing their cloud, providing a single view or a siloed view of usage and cost. Some functions may be more aggressive in controlling usage and costs, and others may not. You get more savings and lower risk when optimizations are managed centrally, as the enterprise benefits from several techniques that we will discuss further in this chapter. Granularity of cloud bills: Daily and monthly charges can range from a few hundred rows to millions of rows when using consolidated billing. This can be intimidating for finance teams trying to create cost models and process all this data. Lack of provisioning constraints: CSPs make it extremely easy to provision and consume resources in the cloud. You can automate the provisioning of resources, which is a huge benefit to the enterprise, but mistakes such as accidental provisioning because of a bug in the code can be costly. Continuous changes: CSPs are continually competing with each other and innovating, adding new features frequently. Enterprises are
94 | Successful Management of Cloud Computing and DevOps
•
•
•
not used to this level of change from a vendor, struggle to keep up with it, and don’t have a clear view on how it affects their costs. Architecture f lexibility: Applications can be deployed using various services and architectures, and the cost varies depending on the deployed architecture. Finding the right balance between performance and costs can create friction between finance and engineering teams. Multiple usage and cost management platforms: Both CSPs and third parties provide cloud management platforms that can be utilized to manage usage and costs. CSPs focus on their clouds and tend to have an advantage (e.g., granualar analsysis of spend and cost) over third-party platforms, and third parties usually support multiple clouds. However, they generally support the top two or three clouds and suffer from feature parity across clouds. For example, third parties may support AWS, Azure, and GCP, but featurewise, they may have full coverage for AWS but lagging coverage for Azure and GCP. Lack of standardization: There are no billing or cost management standards currently being followed by CSPs. Hence, enterprises that utilize multiple clouds will have to repeat processes for each cloud.
Cloud Usage and Cost Management Strategy Framework Cloud usage and cost management are contentious areas within larger enterprises that have multiple functions using the cloud. This stems from the fact that usage and cost have different priorities across functions. For example, a business unit could be adopting the cloud to reduce cost, whereas another business unit values speed and scalability over cost. Having one of the functions be responsible for enterprise-wide cost management will create friction. Enterprises can take a different approach to avoid friction and meet their business objectives by separating usage and cost management responsibilities into two areas:
• Monitoring and reporting: This area focuses on achieving the
•
business outcomes related to usage and cost. For example, an enterprise may mandate that cloud usage and cost reduce current costs by a certain percentage. Alternatively, the enterprise might decide that their multicloud strategy includes a certain share of their spending with a specific CSP. This must be the responsibility of a central organization. Optimization: This area is more operationally focused on actually managing the usage and cost using optimization. This must be the responsibility of each function.
Cloud Operations | 95
By separating usage and cost management into these two areas, the enterprise can drive strategy and centralize governance without friction. To be successful in cloud usage and cost management, the enterprise needs to define the usage and cost management strategy up-front in the enterprisewide cloud strategy.
Usage and Cost Management Life Cycle Smaller enterprises with few accounts can take a tactical approach by starting the optimizations immediately. However, larger enterprises with multiple CSPs and hundreds of accounts will need to take a life cycle approach. The usage and cost management life cycle is broadly separated into three phases:
• Planning and monitoring: This phase should start during the design
•
•
of a new application and involves requirements gathering for the workload, architecture review, understanding usage and cost forecast, and setting up monitoring. This phase’s primary objective is to make sure the workload meets all usage and cost objectives from day one. Optimizations: This phase generally starts after the workload is operational and monitoring is set up. Once you have sufficient data to create a baseline usage and cost for the workload, you can begin optimizing the workload. Typically, optimization would look for unused resources such as EC2 instances, storage, Internet Protocol (IP) addresses, and the discounting of most-used resources using RIs and savings plans and architectural changes to optimize resource use. We will cover this later in this chapter when we review AWS EC2 in detail. Enforcement and reporting: This is the governance phase and will start as soon as you have completed a single cycle of planning, monitoring, and optimization. This phase aims to make sure that all workloads follow the governance framework (policies and standards) and stakeholders receive usage and cost reports periodically, including the reporting of deviations.
Cloud usage and cost optimization are not just about savings by optimization. If done correctly, they can improve your application stack’s overall performance and strengthen security and compliance. CSPs offer several services and encourage utilizing cost-saving best practices. Cost saving is complex, and there are dedicated books on this topic. The next section will explain how you can apply cost-saving strategies to save on AWS EC2 service. Before we discuss these strategies, we will discuss relevant information related to cost management of EC2.
96 | Successful Management of Cloud Computing and DevOps
AWS Reserved Instances AWS RIs have several options, making it harder to understand. In simple terms, RI is a discount purchased for using a specific EC2 instance and gets applied to the hourly rate when you utilize that specific EC2 instance. In addition, in particular situations, RIs also provide reservations for instances within a specific zone or region of AWS. The discount can vary depending on several attributes of the EC2 instance and the term (one year versus three years). The EC2 instance has to match the attributes in order for the discount to apply. Enterprises looking to save money on their cloud usage must familiarize themselves with RIs, which are the most significant cost-saving tool that AWS provides. When used correctly, RIs can give considerable savings; when used incorrectly, they can create waste. We will discuss AWS RIs in this section, but GCP has a similar concept called committed use discounts (CUDs), and Azure calls its discount Azure Reserved Virtual Machine Instances. CUDs and Azure Reserved Virtual Machine Instances work differently compared to AWS RIs, but the overall concept of using discounts to lower the hourly rate of on-demand usage remains the same. RIs provide the following benefits:
• Cost savings: This is probably the most prominent reason why
•
enterprises use RIs. RIs can deliver significant savings (up to 65%) on the hourly rate of on-demand instances when planned and managed carefully. Apart from EC2 instances, RIs are available for all database (DB) engines supported by AWS, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle, and Microsoft SQL Server. Capacity reservation: Capacity reservation guarantees that when you need EC2 capacity, it will be available. AWS and other CSPs distribute their data centers across multiple locations globally. For example, AWS has data centers in Virginia and Ohio on the East Coast of the United States. These are called regions, and each region is subdivided into multiple AZs. AZs within the same region are connected through low-latency links, but each has its own power and cooling systems and operates independently of other AZs. When reserving EC2 instances, AWS allows you to select reservations at the regional or AZ level. RIs purchased at the AZ level come with a capacity guarantee, which is especially useful if your application is designed to autoscale during peak traffic. Suppose that you have not reserved capacity in your AZ in advance. In that case, there is no guarantee that EC2 instances will be available, resulting in autoscaling failures and performance degradation of your application.
Cloud Operations | 97
• Disaster recovery (DR): Although CSPs take precautions against
a specific region’s total shutdown, natural disasters can result in one of the regions going offline. Purchasing RIs in another region, such as the West Coast of the United States, provides the ability to quickly fail over to that region and continue operating your applications.
Now that you understand the benefits of RIs, let’s focus on understanding the basics of RIs, including RI types, payment options, attributes, and terms (one and three years) for you to leverage RIs to the fullest. This information is what you will need to create an RI strategy and manage cost savings using RIs.
Basics of AWS Reserved Instances AWS offers two types of RIs: Standard: These RIs can be purchased with a one- and three-year commitment. These are best suited for steady-state usage and when you have a good understanding of your long-term requirements. They provide up to 65% in savings compared to on-demand instances. Convertible: These RIs can be purchased only with a three-year commitment. Unlike standard RIs, convertible RIs provide more flexibility and allow you to change the instance family, instance type, platform, scope, or tenancy. Because of their flexibility, these RIs also provide lower savings (up to 45%) than on-demand instances. AWS RIs can be bought using three payment options. The choice you make will be based on how you account for the RIs when you calculate the COGS: 1. No up-front: You don’t incur any up-front costs (i.e., you don’t pay until after you have used the resource). You benefit from the discounted hourly rate within the term regardless of usage. So if you made a reservation using this option and did not use the instance, you will still get billed for it. 2. Partial up-front: You pay a partial amount in advance, and the remaining amount is billed at a discounted hourly rate per month. Because you are paying a partial amount up-front, you get a better discount rate than no up-front but still less than all up-front. 3. All up-front: You pay the entire amount up-front, and the discounts start applying to any instance that matches the attributes of the RI after the payment. This option provides the maximum percentage of the discount, which can go up to 65%.
98 | Successful Management of Cloud Computing and DevOps
Reserved Instance Attributes Four attributes affect AWS RI pricing. AWS uses these attributes to match your RI purchase to instance usage:
• Instance type: The instance type offers different computer, memory, • • •
and storage capabilities and is grouped in instance families based on these capabilities. There are hundreds of instance types in AWS, so you have to be careful when you make your selection. Platform: The choice of platform includes Linux/UNIX, SUSE Linux, Red Hat Enterprise Linux, Microsoft Windows Server, and SQL Server. Scope: This is about whether the RI applies to a region or a specific AZ. Tenancy: This can be either default (shared) or dedicated.
Regional and Zonal Reserved Instances (Scope) You can purchase an RI with a regional or AZ scope:
• Regional: When you purchase an RI for a region, it’s referred to as a •
regional RI. Zonal: When you purchase an RI for a specific AZ, it’s referred to as a zonal RI.
Reserve capacity. A regional RI does not reserve capacity, while a zonal RI reserves capacity in the specified AZ. This is important if you want to have guaranteed capacity for autoscaling and also for DR. AZ f lexibility. The regional RI discount applies to instance usage in any AZ within the specified region. For example, if you purchased an RI in the northern Virginia region, the RI discount will apply to any instances matching the criteria in any of the AZs within the Virginia region. The zonal RI discount applies to instance usage only in the specified AZ. For example, if you purchased an RI in us-east-1b, which is an AZ within the northern Virginia region, then the discount applies only to instances running in us-east-1b zone. Instance-size f lexibility. The regional RI discount applies to instance usage within the instance family, regardless of size. They are supported only on Amazon Linux/Unix RI with default tenancy. The zonal RI discount applies to instance usage for the specified instance type and size only. RI purchase queuing. Customers can queue purchases of regional EC2 RIs in advance by specifying a time of their choosing in the future to execute those purchases. By queuing purchases, customers can enjoy uninterrupted RI coverage, making regional RIs even easier to use. By scheduling purchases
Cloud Operations | 99
in the future, customers can conveniently renew their expiring RIs and plan for future events. You cannot queue purchases for zonal RI.
Term Commitment You can purchase an RI for a one- or three-year commitment, with the threeyear commitment offering a more significant discount:
• One-year: A year is defined as 31,536,000 seconds (365 days). • Three-year: Three years is defined as 94,608,000 seconds (1,095 days). There is no option to renew RIs automatically. Hence, you need to track their expiration dates and manually renew them before expiration. If they expire, the discount will no longer be applied, and you will get charged the on-demand rate.
Purchasing Limits for Reserved Instances There is a limit to the number of RIs that you can purchase per month. For each region, you can purchase 20 regional RIs per month, plus an additional 20 zonal RIs per month for each AZ. For example, in a region with four AZs, the limit is 100 RIs per month: 20 regional RIs for the region plus 20 zonal RIs for each of the four AZs (20 × 4 = 80).
Savings Plans RIs have been and continue to be an excellent way to save on on-demand usage. RIs were first introduced when there were five EC2 instance sizes and three regions with all up-front payments for one year or three years. RI management was easy back then, but today there are 200+ EC2 instance sizes, 22+ regions, and several attributes to consider when purchasing RIs. All of this makes RI management complex and challenging to manage across a large enterprise. Keeping this in mind, AWS launched savings plans, another discount program from AWS that allows customers to make a per-hour usage commitment of resources, and in return, AWS provides discounts. Savings plans apply to EC2, Fargate container services, and AWS Lambda serverless computing. Any consumption above the committed per-hour amount is charged at an on-demand rate. Savings plans offer more flexibility and limited purchasing options, thus making them simpler and easier to manage. Similar to RI attributes, savings plans have a few attributes:
• The type of plan: EC2 Instance savings plan or computer savings plan
100 | Successful Management of Cloud Computing and DevOps
• The term of the commitment: one year or three years • Payment for the commitment: All up-front, partial up-front, or no • •
up-front, as described for RIs The region (EC2 Instance savings plans only) The instance family (EC2 Instance saving plans only)
There are two types of savings plans that give customers a choice between maximizing the financial benefits of the discount program (by sacrificing flexibility) and maximizing flexibility while benefiting from a smaller discount. In this respect, they are similar to standard RIs and convertible RIs. It’s no coincidence that the maximum savings plan discounts match those of the RI discount program.
EC2 Instance Savings Plans EC2 Instance savings plans can offer savings of up to 72% compared to on-demand rates and depend on the term of the commitment, payment, instance family, and region. This is similar to the purchase of standard RIs and locks you into the instance family and region. For example, you can purchase a specific instance family (e.g., M5) in a specific region (e.g., northern Virginia) and can change instance size, operating system, and AZ, so long as you are using an M5 instance in northern Virginia.
Computer Savings Plans Computer savings plans provide the most flexibility and help to reduce your costs by up to 66%. They are similar to convertible RIs. For example, you can shift from C4 to M5 instances, shift a workload from one region to another, or migrate from EC2 to Fargate or Lambda at any time. None of these actions interrupt the coverage or pricing of your AWS savings plan. The advantages of AWS savings plans over RIs include the following:
• Unlike RIs, savings plans automatically benefit from price • • • •
adjustments that occur during the commitment. AWS savings plans don’t lock you into certain instance types like RIs, which means that you can make changes within an instance family to be more cost-efficient. Savings plans are simpler to transact and manage compared with RIs. Computer savings plans provide discounts without committing you to any specific instance type. AWS savings plans allow you to flexibly transfer workloads between instance types, sizes, and generations to meet changing demands and architectures.
Cloud Operations | 101
AWS savings plans also have some disadvantages, including the following:
• AWS savings plans can’t be purchased for relational database • • • •
service (RDS), Redshift, and other services. AWS savings plans don’t offer opportunities to resell or offload underutilized commitments. AWS savings plans charge on-demand prices for utilization that is not covered by the savings plan. AWS savings plans don’t provide capacity reservations. AWS savings plans don’t often provide better discounts than RIs.
There are several advantages of AWS RIs over AWS savings plans:
• RIs for shorter terms can be purchased on the RI Marketplace. • RIs can include discounts for RDS, as well as EC2 (but not Fargate). • RIs can provide some of the largest discounts on the higher end (60% or more) in the case of some three-year up-front terms.
Cost Management Strategy In the previous section, we described the various options for saving when using the AWS EC2 service. EC2 is one of the most popular services in AWS. Apart from private pricing agreement (PPA) discounts (discussed in Chapter 3), RIs and savings plans are other means to get discounts on EC2 services and cover a few other services. AWS EC2 is one of the 200+ products and services offered by AWS. You can imagine the complexity of understanding the various options to save for each of the 200+ products and services. The general best practice is to optimize the cost of the top three to five services that the enterprise consumes. This list usually includes EC2, S3 storage, network transfers, RDS, and CloudFront. It could vary across enterprises, but these services are among the most consumed. Between EC2 RIs and savings plans, you have a lot of flexibility to manage your savings, but it still requires a lot of advanced planning and coordination across the enterprise to maximize savings. Generally, when you consider purchasing RIs and savings plans, you must decide if you want to buy at each AWS account level or purchase at the enterprise-wide level. There are advantages and disadvantages to both strategies. For example, if you buy at the account level, the following advantages and disadvantages apply. Advantages:
• You have complete control over how many RIs you want to
purchase and the types of RIs and savings plans you must buy to maximize discounts and minimize waste.
102 | Successful Management of Cloud Computing and DevOps
• You have complete control over your RIs and savings plans and • •
can exchange one class for another or sell unused RIs in the RI marketplace. You have full visibility into the usage and cost of RIs and savings plans, which makes it easier to manage costs. You can choose to share RIs with other accounts to minimize the waste of unused RIs (this works only with consolidated billing enabled).
Disadvantages:
• Buying and managing RIs and savings plans at each account level • •
can be complex and require routine management. Having to deal with unused RIs, while that is an advantage, can be a problem if you have to deal with it on a per-account basis. There is no marketplace to sell used savings plans. If you end up overbuying, you will lose your investment.
In general, the best practice is to set up consolidated billing and purchase RIs at the account level if you are confident of current and future requirements. Enabling “float” allows unused RIs to be shared across other accounts. With savings plans, you can make a much larger per-hour commitment for the entire enterprise if purchased and managed centrally.
Best Practices for Purchasing RIs and Savings Plans There is no definitive answer or strategy that works for every enterprise. It will vary by enterprise, but here are some general best practices:
• RIs apply in EC2, Redshift, RDS, and Elasticache, while savings • • •
plans work only in EC2 and Fargate. Hence, purchase RIs to cover multiple services and use savings plans for what is left over without RI coverage. If you are looking for a balance between flexibility and cost savings, savings plans are the best option, as they offer a lot more flexibility. RIs can be sold and purchased on the RI marketplace. There is no marketplace to sell or buy savings plans. Buy RIs when you can predict your workload requirements and savings plans when you cannot.
Cost Management Tools Whether you are just starting out using the cloud or have been using it for a while, cloud cost management is one of the top challenges with cloud management. Cloud cost management tools help with cloud spending management, purchasing reserved capacity to optimize spending and drive cultural
Cloud Operations | 103
changes across your organization by providing visualization of cost and spend data and historical usage data for consumers. When it comes to tracking and reporting on usage, making purchasing decisions, and ongoing management of your RI and savings plans, you have three choices of tools: CSP-provided tools: Each CSP has a built-in console to view your costs on a monthly or annual basis. For instance, Cost Explorer is the built-in tool in AWS that helps you explore your costs. CSP-provided tools have the unique advantage of directly accessing customers’ billing data, and they can be set up and configured quickly. They are excellent for starting out and working at a small scale, but teams often struggle to use them as the single source of information and often need to use additional tools to supplement them. CSP tools are primarily helpful for finance teams who need a high-level view of their costs and opportunities to save money across their whole cloud infrastructure. They only support their cloud environment. The advantages are as follows:
• Good for small-scale usage of the cloud, with a simple cost • • • •
structure Provide a high-level view of how costs are distributed across different cloud services (computer, storage, network) Provide recommendations for saving and optimization Allow you to set budgeting alerts Free for customers
The disadvantages are as follows:
• Require extensive tagging of resources to get granular cost visibility
• May not be able to track idle or unallocated resource costs • Difficult to rely on exclusively as the organization scales, even for high-level finance
• Not designed to be proactive; you can see cost reports only after the fact (with some ability to create warnings)
Third-party tools: Several third-party tools from leading independent software vendors (ISVs) are excellent alternatives to CSPprovided tools. These providers support multiple clouds and have a lot more features and functionality. If your enterprise uses multiple clouds, using a third-party tool is your only choice. Third parties generally prioritize rolling out new features for the most popular
104 | Successful Management of Cloud Computing and DevOps
cloud. Although they support multiple clouds, there is a delay in rolling out features to the second and third clouds. In most cases, AWS, GCP, and Azure clouds are supported, with some ISVs also supporting data centers. Do-it-yourself tools: There is always the option of building these capabilities within the enterprise yourself. CSPs provide hourly, daily, and monthly downloads of their billing data, which the CSP and third parties use as the foundational data for their cost management platforms. Managing the cost data in spreadsheets and databases works for smaller organizations with only a few accounts. Some larger organizations have invested in building out a dedicated team that can manage cloud costs. Doing this yourself gives you granular control over your data, and you can quickly produce customized reports. However, it comes at an additional cost of having to hire and maintain a team. If you don’t plan carefully, your cost management team could cost more than what your optimizations save you. Cloud cost management is an ongoing, continuous process that will require extensive access to and visibility of data. Regardless of which options your enterprise chooses to manage cloud costs, you will need a mechanism to plan, track, optimize, and report on your cloud usage and cost savings. If you don’t manage cloud costs, your cloud TCO may end up being higher than that for the traditional data center.
Cloud Migration Cloud migration is the process of moving enterprise applications and data from an existing environment to the cloud. Enterprises operating data centers or running in colocation facilities have invested in virtualization, containerization, and private cloud technologies. Hence cloud migration does not just involve migrating from physical, on-prem data centers; it also includes migrating from one cloud to another and from virtualized and containerized environments to the cloud. Cloud migration offers all the benefits that the cloud provides: shifting from CapEx to OpEx, elasticity and scalability, agility and flexibility, cost savings and productivity, performance, reliability, resiliency, security, and compliance. However, migrating to the cloud requires a strategy for evaluating applications to see if they would benefit from using the cloud. Before an enterprise embarks on its cloud migration journey, it is crucial to identify these steps:
Cloud Operations | 105
• Understand the business objective of migrating to the cloud:
•
Enterprises must have clearly defined business drivers that lead them to migrate to the cloud. The cloud strategy document should define these drivers, and everyone involved with the migration process should be aware of the business objectives. Any misalignment across the teams will fail to deliver the business outcomes that the enterprise is expecting. For example, if lowering costs is one of the desired results, and the cloud migration does not do this, the business objective is not achieved. Devise a strategy for classifying applications and data that will be moving: Unless there is a mandate to migrate an entire data center, including all applications and data, enterprises will have to be selective when migrating applications to the cloud. Some applications will benefit from migrating to the cloud, and others will not, so there must be a strategy for cloud migration. Gartner has defined the “5 R’s framework,” which several other vendors have adopted as a cloud migration strategy. In addition to this framework, the enterprise must consider if it wants to retain or retire specific applications. In this section, we will discuss this framework and primarily focus on how to plan for migration. Cloud migration does not always result in meeting business outcomes for an enterprise. In this case, the enterprise must evaluate which applications deliver these outcomes and what must be done with the applications that do not. The 5 R’s framework helps with the decision-making process. Rehost: This is the “lift-and-shift” strategy discussed earlier in this chapter; that is, you are moving an application as is, without modifications to the application itself, but you may make some modifications to the computer infrastructure on which it runs. For example, an application running on a physical server in your data center will migrate to an AWS EC2 virtual machine. So long as the application does not have a dependency on the underlying hardware, it should be relatively quick and easy to migrate the application. There is no guarantee that rehosting an application will help it benefit from using the cloud. For example, your cloud costs to run this application may be higher than running it in the data center, or you may notice performance degradation when it runs in a virtualized environment. This strategy is generally good for legacy applications that you don’t have complete knowledge of and control over.
106 | Successful Management of Cloud Computing and DevOps
In this strategy, you make a few changes to an application to take advantage of cloud capabilities such as elasticity and the numerous IaaS and PaaS services offered by CSPs. For example, you can modify an application using a MySQL cluster to use AWS RDS, thus reducing the cost and complexity of operating a MySQL cluster in the cloud and using a database service that is highly scalable and reliable. There are several other services on which an application may depend, and carefully evaluating alternative services in the cloud and switching to using these services enables the application to benefit from the cloud. Some changes may require modifications to the code, and you will have to make sure that you can modify the code and meet runtime requirements to replace it with cloud services. Rearchitect: In this strategy, you make several modifications to optimize the application to run in the cloud, utilizing as many cloud-native capabilities as possible without completely rewriting it. For example, in this strategy, you could consider containerizing the application, utilizing object storage, adding horizontal scalability options, or utilizing PaaS services such as databases. This strategy will take a significant commitment from engineering and operations teams. All the changes will need to be tested and validated in addition to making modifications to culture, technology, and processes within the enterprise. Rebuild: In this strategy, you will rebuild the entire application from scratch, incorporating cloud-native capabilities. This strategy is almost equivalent to a greenfield (i.e., in the initial stages of design or development; see Chapter 9) application development, except that you may choose to use some of the existing code from the application. This strategy enables engineering teams to design for the cloud utilizing new architecture paradigms, simplify the application architecture, and address technical debt. This strategy will take the most amount of time and effort and add to the migration cost. Consider this option for missioncritical applications with management backing and support. Replace: In this strategy, you replace existing homegrown or third-party applications running on-prem in your data center with their SaaS equivalent. Alternatively, you can consider replacing the application with a new third-party solution that is readily available from the CSP’s cloud marketplace. Revise:
Cloud Operations | 107
•
Replacing an application is one of the fastest ways to reduce the number of legacy applications. However, it could involve contractual obligations or data migration constraints that prevent you from replacing it immediately. In this case, you take a phased approach to utilize the more recent software and replace the existing application in a phased manner. Retain: There will always be applications that cannot be migrated to the cloud for various reasons, including security and compliance, the fact that they no longer have access to their original code or runtime environment, and they have highly optimized workloads that are best run out of a data center. In this situation, you need a strategy to retain and continue running them on-prem. Retire: Cloud migration offers an opportunity to retire applications that are no longer required or actively being used. The enterprise must create and maintain a catalog of all applications and identify the ones in this category. These applications can be retired. Plan the migration of applications and data: In this step, we identify all the up-front work that needs to be done to support the migration effort and identify all the stakeholders involved with the migration. In a large enterprise, several stakeholders need to be included in the planning and cloud migration process. This consists of the following teams: Cloud strategy team: The CCoE should drive enterprisewide planning, and each of the functions should own the responsibility for migrating its own applications and data. Cloud operations should be responsible for setting and managing the centralized capabilities, such as setting up network connectivity between the enterprise and the CSP and setting up the initial organization structure in the AWS Organization to create the various accounts and hierarchy, along with SSO and basic account management activities. Architects: Typically, IT and business units have architects who are responsible for identifying application dependencies and can look at the strategies and processes for a specific application. Their insights are valuable in identifying and addressing requirements before the migration process starts. Security and compliance: Specialized teams are required to identify and address security- and compliance-specific risks associated with migrating to the cloud.
108 | Successful Management of Cloud Computing and DevOps
classif ication: The data classification team’s role is to identify all the data that would get migrated to the cloud and make the necessary adjustments to the classification process. Finance: The finance teams contribute to the calculation of TCO and return on investment (ROI) and are primarily responsible for evaluating the migration process to ensure that the business outcomes related to cost are being met. Operations and application: The operations and application teams from all the functions will drive the actual migration process. They will be involved with identifying applicationspecific dependencies and risks, functional priorities, and requirements. Migration: In this step, you migrate the applications and data. Always start with experimenting with migrating a few simple applications and learning from the experience. Each function responsible for migration should identify low-risk applications and initiate the migration with a target end state in mind. This helps identify patterns for migration that can be standardized and utilized for migrating the rest of the applications. Data
•
The migration process can be divided into six distinct steps, illustrated in Figure 4.3. 1. Discover: This involves the detailed exploration of each application’s requirements. Discovery must focus on business requirements and technical requirements. The business requirements discovery would include SLA and service-level objective requirements, cost requirements, future state road map, operational processes, and runbooks. The technical requirements would cover areas such as performance, network connectivity, storage, security, and other requirements. 2. Design: Based on the detailed discovery, this phase involves the target state architecture for the application and supporting operational components and processes to address the business and technical requirements that are identified. The design phase can be quick and easy for rehosting and extremely complex for rebuilding an application to meet the migration strategy. 3. Build: In this phase, the migration team will review the design document and create a plan and schedule for the migration of the application. If the application must be redesigned or rebuilt, additional teams will be engaged to start work on the applications.
Cloud Operations | 109
DISCOVER 1
CUTOVER
DESIGN 2
6
VALIDATE
BUILD 5
3
INTEGRATE 4
Figure 4.3 Migration process.
Once the application is redesigned or rebuilt, it will get deployed in the cloud. 4. Integrate: The build phase manages only the deployment of the application in the cloud. This phase is responsible for making all the external connections and addressing dependencies such as API calls, database access, enabling authentication and security, and other dependencies such as DR. Finally, the application is brought online to make sure that it is able to run successfully. 5. Validate: This is the quality assessment phase, in which the application will be validated (functional, performance, DR, and business continuity tests) before declaring it to be ready. Once the validation is complete, all the documentation and processes will be updated to reflect the application changes and runbooks for the support teams to start supporting the application. 6. Cutover: This is the final step, and depending on the application, it could involve shutting down the on-prem environment and switching all users to the migrated environment in the cloud, including final data migration, Domain Name System (DNS) changes, security and compliance verification, and final validation.
110 | Successful Management of Cloud Computing and DevOps
Challenges with Cloud Migration Successfully migrating to the cloud requires a clear cloud strategy, a CCoE that can govern, and functional teams with the right skills and knowledge. Understanding these needs and having a plan to meet them quickly will give the enterprise a considerable advantage. Next, we’ve listed six common cloud migration challenges that businesses face, and how you can overcome them: • Managing costs: The cloud can deliver cost savings, but understanding and calculating cost savings can be challenging. During the discovery and design phases, cost estimates of the current on-prem solution should be estimated and compared with cost models in the cloud. By making this comparison, you can keep costs in mind and optimize the architecture to meet the business requirements as you formulate the design. • Skills: The enterprise has to invest in extensive training of their personnel involved with the cloud. Training and practical experience will reduce or minimize knowledge related to cloud migrations. • Complexity: The cloud is generally easier to manage because enterprises don’t have to manage physical computer infrastructure or worry about capacity planning. However, without proper governance, cloud migration can quickly get out of control, with hundreds of accounts and thousands of resources getting provisioned. Enterprises can lose visibility of where their applications and data are located. For instance, it is very easy for someone to accidentally create an AWS S3 bucket in Europe and store US data there, or vice versa. • Multicloud: Dealing with a single cloud is already complex. However, if the enterprise has decided to use multiple clouds, an even more elaborate discovery, design, build, integrate, validate, and cutover cycle is required to migrate applications. • Dependencies: It is very rare to have an enterprise application that is stand-alone (i.e., it does not have dependencies on other services within the enterprise). Hence, the discovery phase will need to account for all dependencies and make sure that the dependent applications or services are migrated in parallel. • Legacy applications: Some legacy applications are extremely challenging to migrate, even with a rehost strategy. Hence, it is important to decide whether it is worth the effort and frustration of migrating such applications. As alternatives, you can rebuild or replace the applications. • Security: Understanding cloud security before you start migrating is crucial, as this will make sure that each application that gets
Cloud Operations | 111
•
•
migrated is secured before the cutover step. You want the same level of security or higher security when you migrate to the cloud. Data: Enterprises have large amounts of data that they have stored over the years and continue to add data on a daily basis. Migrating large amounts of data from the premises to the cloud can take days, weeks, or months. Making sure that you don’t lose data during cloud migrations makes the process extremely challenging. Stakeholder support: Cloud migrations can go off track or get delayed for several reasons. Enterprises need to continually keep stakeholders informed and get their support to continue to progress.
Cloud Security Cloud security is the protection of enterprise data, applications, and computing resources in the cloud. This is largely the same for the cloud as for data centers, except the methodologies, policies, standards, and frameworks used to protect assets (data, apps, and resources) in the cloud are different from those in the traditional models. There are fundamental differences in the way that traditional data center environments are protected versus cloud environments. The traditional model protected assets by creating a virtual perimeter around them. For example, data centers utilized firewalls to control all ingress and egress traffic. Employees accessed assets via secure local area networking/wide area networking (LAN/WAN) connections or used virtual private networks (VPNs) to connect to remote locations. The security team had complete visibility of and control over all traffic and employed specific tools and processes to protect the enterprise. In the cloud model (IaaS, PaaS, SaaS), the enterprise no longer controls the data center, network, physical servers, or services. In the case of SaaS, they do not control the application stack either, and they have very limited visibility of the security controls on the platform. The cloud is entirely software-driven and can be provisioned, configured, and managed programmatically. Cloud security has to deal with the enterprise data and applications and manage cloud environments and services that the CSPs provide. The CSP uses a shared-security model that splits responsibility for security between the CSP and the enterprise. In the traditional model, all changes to the data center, including firmware upgrades to hardware, are carefully managed and controlled by the security team using change management processes. Enterprises work closely with
112 | Successful Management of Cloud Computing and DevOps
hardware and software suppliers to detect and fix security vulnerabilities periodically. In the cloud, CSPs and enterprises use CI/CD approaches to continually add new features to their applications and services, thus exposing them to new threats and vulnerabilities. Public clouds are designed for multitenancy, which means that the computer infrastructure is shared across multiple customers. In a multitenancy system, all computer resources are shared with numerous tenants, and any malicious tenant can exploit a vulnerability and access other tenants’ data. For example, in 2018, the Spectre and Meltdown vulnerabilities allowed unauthorized access to the memory of other VMs, thus potentially enabling tenants to use the VM instance memory of other tenants. This could also happen in the data center model; however, the risk is much lower as enterprises do not share their hardware with random tenants and are in total control to address this vulnerability quickly. In contrast, they have to depend on the CSP to address the vulnerability in the cloud.
DevSecOps and Cloud Security While DevOps practices have helped streamline software development by creating CI/CD pipelines and automating applications’ deployment, security was still an afterthought. Security teams were generally outside the software development function and focused on identifying vulnerabilities when the application was ready for deployment. This slowed deployments, undoing the productivity that DevOps practices introduced. This resulted in integrating security practices within DevOps to create the DevSecOps practice. In the DevSecOps practice, the DevOps team shares responsibility for the security of the application and integrates the detection of vulnerabilities via the CI/CD pipeline. Thus, when the application gets built, you also can see potential vulnerabilities. Enterprises can add a cybersecurity specialist to each of the teams or invest in training their software development and operations teams. As enterprises evolve to create DevSecOps practices, shortages of relevant training and talent continue to be problems in the industry. CSPs have started addressing this gap with training and documentation, such as AWS’s Security Pillar in its Well-Architected Framework. DevSecOps enables the enterprise to build and deploy more secure application stacks, but it still does not address all cloud security–related challenges. When using the cloud, application security is one of the areas that needs to be discussed along with a number of other areas. Data security. Data is one of the critical assets that an enterprise owns, and it is often sought after by threat actors. Applications can create and manage
Cloud Operations | 113
data or consume data produced by another source. With DevSecOps, the focus is to protect the data created and managed by the application, but data outside the application’s control also needs to be addressed and protected. Typically, enterprises use data classification levels (i.e., low, medium, high) to control data access. They may be forced to classify data in certain ways due to the regulatory compliance requirements from industry or governments. For example, Europe has several compliance standards, including GDPR in the European Union and G-Cloud in the United Kingdom. The United States has several regulatory requirements covering data classification, such as the Federal Risk and Authorization Management Program (FedRAMP), International Traffic in Arms Regulation (ITAR), and the Health Insurance Portability and Accountability Act (HIPAA). In the cloud, data gets stored in many forms and many services and locations. For example, data can be stored in S3 buckets, databases, Cloudformation templates, or files on EBS volumes across multiple AZs and regions. In all these cases, data must be protected while in transit and at rest. Several techniques can be used to protect data, and CSPs offer key capabilities that enterprises can leverage to secure their data. Data can be protected by tokenization or by using encryption technology. In all cases, given the cloud’s multitenant nature, data must be protected at all times. Data can be in three states:
• In motion: Typically, this happens when a browser is sending data to • •
an application over the network or an application is sending data to another application. In use: When data is actively being processed, it gets stored in the VM’s random access memory or a temporary storage location. At rest: When data is not being processed, it is stored on persistent storage devices.
In all three states, data must be encrypted or tokenized to prevent illegal access of the data. CSPs make it easy to store data and manage encryption by offering key management and other services that can help identify sensitive data using machine learning. Data encryption and tokenization can protect against a wide variety of attacks, including threat actors getting access to a physical medium, storage systems, hypervisor, operating system, or applications. Computer infrastructure security. The cloud offers standard computer, storage, and network infrastructure resources that are protected by the CSP, but the operating system and the layers above it are the enterprise’s responsibility. Each DevSecOps team across the enterprise can manage to secure
114 | Successful Management of Cloud Computing and DevOps
the operating system. However, enterprises must have consistency in how operating systems and applications are secured across the enterprise. For example, the enterprise may choose to provide a hardened operating system image to be used enterprise-wide in the cloud environment. This image should be maintained by a central team and shared across the enterprise. Cloud networking is another area that can be secured by DevSecOps teams based on enterprise-wide best practices that are created and maintained (as code) by a central security team. In addition, connectivity between the enterprise and the CSP and the setting up of virtual private clouds (VPCs), subnets, DNS, and routing are areas that need to be handled centrally. Identity and access management (IAM). CSPs offer enterprises the option of either using their proprietary authentication and authorization mechanisms or integrating with their enterprise SSOs. Enterprises have realized the benefit of being able to authenticate their employees in the cloud using their existing SSO solutions. This provides greater control over security and reduces the complexity associated with managing different accounts for each employee. In addition to SSO use for employees, enterprises may need an identity mechanism to authenticate their end users (or customers) who use their products and services. In this situation, it is easier and more manageable for the enterprise to use a central identity system that can be shared across multiple products and services to authenticate customers. Larger enterprises that offer numerous products or services will have customers subscribing to numerous products and services. Having an SSO solution for customers to authenticate once with the enterprise drastically reduces the complexity of managing multiple accounts and access controls for the same user across numerous products and services.
Challenges with Cloud Security In the cloud, CSPs have a shared security model, and hence enterprises must have a security strategy and adequate controls in place to continuously monitor and address threats. Cloud Security Alliance (CSA), a nonprofit cloud security organization, has noticed a drop in the threats against infrastructure-related components under CSPs’ responsibility. This does not mean that these threats are completely addressed; it just means that the attackers are focusing higher on the technology stack, which is a concern to businesses. CSA has identified a number of major cloud security threats. Data breaches. In data breaches, the attacker’s primary objective is to steal data from the enterprise. CSPs will not take responsibility for data breaches if the violation is directly related to the enterprise’s own lack of security
Cloud Operations | 115
controls. AWS offers several storage options, such as Elastic Block Store, Elastic File System, FSx, S3, and databases. With so many options, security teams have to understand how each one works and how to implement security controls and create policies to protect their data. AWS provides several mechanisms such as encryption, versioning, and access control to protect data both at rest and in transit, but it is the responsibility of the enterprise to enable and use these capabilities. Often, due to a lack of policies and best practices or sufficient knowledge about S3 services, application developers fail to utilize these mechanisms; as a result, data can get stolen. In the traditional model, data was protected behind an enterprise firewall, and data could reside in one of the locations where the enterprise had a data center. However, in the cloud, enterprises have access to several virtual data centers (i.e., each of the locations of the CSP has its own presence and no central firewall protects all the services). Each service has its own nuances and must be protected using service-specific mechanisms. Let’s briefly look at what it takes to protect data in the AWS S3 service. AWS best practices for protecting S3 fall into two categories:
• Preventative security best practices • Monitoring and auditing best practices AWS recommends the following preventative security best practices to protect data stored in an S3 bucket:
• Use the correct S3 bucket policies to prevent public access: S3
• •
•
buckets can be accessed by a specific application or made publicly accessible. For example, you could allow downloadable content for your website from an S3 bucket. If the enterprise does not intend to make data publicly available, it has to create a policy to restrict such access. Implement least privilege access: Based on the enterprise’s data classification standards, S3 service must be configured to allow access based on least privileges. Use IAM roles for applications and AWS services that require S3 access: Data could be accessed by users or applications. In both cases, S3 configuration should allow access to users and applications, depending on the role. For example, applications may require create, read, update, and delete access, whereas users may need only read access. Enable multifactor authentication (MFA) delete: This mainly applies to the deletion of a bucket. Without additional protections such as MFA, the enterprise risks the accidental deletion of an entire bucket.
116 | Successful Management of Cloud Computing and DevOps
• Consider encryption of data at rest: The S3 service allows server-side • • •
• •
and client-side encryption with multiple options to store encryption keys, but the enterprise must decide if it wants to encrypt the data. Enforce encryption of data in transit: Access to S3 can be via HTTP or Hypertext Transfer Protocol Secure (HTTPS). The enterprise policy should mandate that all bucket access must be via HTTPS only. Consider S3 Object Lock: S3 Object Lock is a mechanism that enables enterprises to store objects using a Write Once Read Many (WORM) model, which offers protection against accidental deletion. Enable versioning: Similar to version control systems such as Git, S3 enables storing changes made to the objects. You can use versioning to preserve, retrieve, and restore previous copies of an S3 object. With versioning, you can quickly recover from both unintended user actions and application failures. Consider S3 cross-region replication: S3 already provides a high degree of durability, but specific compliance requirements may require cross-region replication, which offers an additional layer of protection, as you have more copies of the data. Consider VPC end points for S3 access: As pointed out earlier, S3 buckets can be publicly available or restricted via policies that allow only internal access. You can use VPC end points to manage internal access.
AWS recommends the following best practices to monitor and audit data stored in S3 buckets:
• Identify and audit all your S3 buckets: AWS provides tools such as
•
•
AWS S3 Inventory and Tagging to track your S3 buckets. Enterprises with teams working on multiple applications will find this S3 inventory tool helpful for auditing and reporting on the status of the data. Security teams will need access to audit reports to meet compliance, such as GDPR. Tagging data and using S3 inventory to track where data resides can be valuable to the enterprise. Implement monitoring using AWS monitoring tools: AWS CloudWatch can be utilized to track and alert on HTTP codes (e.g. PutRequests, GetRequests, 4xxErrors, and DeleteRequests). This provides near real-time tracking of activity on the S3 bucket and its objects. Enable Amazon S3 server access logging: Server access logging allows for collecting logs related to the requests made to a bucket. Access logs can be a practical way to track publicly accessible
Cloud Operations | 117
• •
• •
buckets, identify patterns of usage (or attacks), and help take preventative measures to safeguard the data. Use AWS CloudTrail: AWS CloudTrail enables the tracking of users and their actions on S3. This enables pinpointing and tracking user activity to protect against unintentional access. Enable AWS Config: AWS Config monitors resource configurations, allowing you to evaluate them against the desired secure configurations. Using AWS Config, you can review changes in configurations and relationships between AWS resources, investigate detailed resource configuration histories, and determine your overall compliance against the configurations specified in your internal guidelines. Consider using Amazon Macie with S3: Amazon Macie helps identify sensitive data by using machine learning and can help identify data sets with sensitive data that the enterprise failed to protect adequately. Monitor AWS security advisories: Use the Trusted Advisor service to regularly check security advisories posted there for your AWS account. It also identifies publicly accessible S3 buckets.
The key takeaway is that CSPs provide hundreds of services, each of which has several mechanisms to protect it. CSPs also provide best practices recommendations, but ultimately it is the enterprise’s responsibility to utilize these services or third-party services to protect its data against internal and external threats. CSP-provided services usually have an advantage because of the deep integration with the platform, but if your enterprise uses multiple clouds, it gets very complex to implement best practices across each cloud using each CSP tool. Using a third-party provider who has an offering across multiple clouds would be the best approach in this scenario. Misconfiguration and inadequate change control. The absence of effective change management is a common cause of misconfiguration in a cloud environment. As discussed in the previous section, CSPs offer multiple services with several ways of securing them. This makes it challenging to identify the best option to secure a particular service and implement adequate controls in the cloud. The cloud also enables faster code deployments with infrastructure as code, which means that changes can be rolled out multiple times in a day. This makes it essential to also implement security as code and governance as code. Lack of cloud security architecture and strategy. Cloud security is very mature, but it is rare to see enterprises migrate to the cloud for security reasons.
118 | Successful Management of Cloud Computing and DevOps
Enterprises migrating to the cloud generally focus on migration, cost, and architecture-related issues, and security is usually not a priority. Without a security architecture and framework in place, the enterprise usually considers implementing security as an afterthought, which introduces risks that are hard to address once the enterprises have started operating in the cloud. Insufficient identity, credential, and access management. CSPs offer their own IAM services and support enterprise SSO. It is up to the enterprise to decide which option to use, the structures it will use to manage multiple CSP accounts, and the applications that will run in these accounts. Without a clear strategy, having a mixed approach of using CSPs, IAM, and SSO will make it challenging to track and monitor users. Account hijacking. In this form of attack, someone aims to take over one of the cloud accounts. All AWS accounts have a root user, who has complete control over the account. Unauthorized access to the root user on an AWS account can be devastating to the enterprise. It gives the attacker access to the applications and data residing in this account, and the attacker can spend an unlimited amount of resources, resulting in monetary loss to the enterprise. Using best practices such as enabling MFA for the root account or locking the root account helps protect against such attacks. Still, security teams must have a thorough understanding of protecting the account and implementing controls to monitor and enforce restrictions on root access. Insider threat. The threat of an insider attack in the cloud is no different from traditional environments. Without proper data classification and appropriate access controls in place, data could potentially be exploited from the enterprise. Hence, security must address internal and external threat actors. Insecure interfaces and APIs. One of the advantages of the cloud is that you can manage everything programmatically, which significantly speeds up software development. While infrastructure as code has its benefits, it also comes with the risk of exposing these interfaces and APIs without sufficient access control. In addition, poorly designed APIs could lead to misuse or even a breach of data. Broken, exposed, or hacked APIs have caused some major data breaches. For example, suppose that an enterprise accidentally exposed the S3 read interface to the public. This could result in everyone having access to specific content, and depending on the size of the content, hundreds of thousands of reading requests could quickly add to the daily egress bandwidth costs. Limited cloud usage visibility. In traditional models, enterprises had complete visibility of their data center locations, infrastructure in these locations, and the various applications running on them. In the cloud, it is hard to track
Cloud Operations | 119
who uses what, and services such as AWS Cloud Marketplaces allow anyone with access to an account to purchase and deploy third-party applications. Apart from dealing with CSP-provided services, security teams will now have to deal with thousands of ISV products that are available quickly without using a procurement process or security review. While CSPs have some checks and balances to evaluate products offered in their marketplaces for security (e.g., they scan for vulnerabilities), it is another application that the security teams can use to monitor and enforce controls. Abuse and nefarious use of cloud services. Attackers are not always targeting the cloud for a direct assault. Sometimes they use it to attack other destinations, host malware, or launch phishing attacks. For example, the cloud, with its scalability, offers an excellent platform for launching DDoS attacks. Security teams should also monitor for outbound attacks from their cloud environments and focus on inbound attacks.
Cloud Marketplaces The cloud makes it easier and faster for enterprises to launch products and services using IaaS and PaaS. Most products and services utilize third-party software (open-source and commercial) and services that aid with functions such as software development, testing, security, compliance, analytics, and monitoring. Using third-party software as building blocks in an application significantly reduces time to market. Typically, within an enterprise, each of the commercial software vendors will have to go through a procurement process that involves sourcing activity, contract negotiation, security reviews, pricing negotiations, vendor onboarding, and setup. In the cloud’s on-demand and pay-per-use model, the old procurement model with centralized accounting, budgeting, and long lead times will not work. When product and engineering teams think in terms of hours and days for shipping software, procurement models cannot think in months and quarters. This is where public cloud marketplaces comes in. These are online services offered by CSPs that allow their customers to easily find, purchase, deploy, and manage third-party software, data, and services needed by the enterprise. The concept of a marketplace has existed for a long time, with Apple and Google launching marketplaces for their mobile device users. The CSPs followed a similar model when they launched public cloud marketplaces to connect independent software vendors (ISVs) with their customers. Public cloud marketplaces provide multiple options for ISVs to sell and for enterprises to consume. For example, AWS Marketplace allows ISVs to sell applications, SaaS, data, algorithms, and professional services. Applications
120 | Successful Management of Cloud Computing and DevOps
can be delivered and consumed as Amazon Machine Images (AMIs) or containers. In both cases, the applications run on the CSP’s IaaS and PaaS and must be designed to meet the CSP’s specifications. The buyers (i.e., enterprises using these applications) run them on the CSP’s IaaS and PaaS platform. ISVs that intend to sell on multiple CSPs must design their applications to run on multiple clouds. Both ISVs and enterprise customers benefit from public cloud marketplaces, as discussed next.
Benefits to ISVs ISVs can list software and services on a CSP’s cloud marketplace. The CSP provides a digital catalog service for listing and processes the sales transaction. The CSP typically charges the ISV a seller’s fee and makes additional revenue when the software gets deployed on their IaaS and PaaS platforms. ISVs see the following benefits:
• Access to a new market: Cloud marketplaces provide immediate •
•
• •
access to thousands of customers. By default, all the CSP’s customers have access to cloud marketplaces and can purchase and deploy them within a few minutes. Streamlined selling process: The enterprise does not have to deal with the traditional sales actions: prospecting, connecting and qualifying, researching, presenting, negotiations, and closing. All of these are handled by the public cloud marketplace. Some cloud marketplaces also provide capabilities for ISVs to sell via their channel partners. New business models: Cloud marketplaces offer buyers several options to purchase and use the software. For example, they support pay-as-you-go hourly or annual and multiyear subscriptions. Sellers can also sell their software based on various categories (e.g., per user, per host, bandwidth processed, and data stored). Enterprises that do not have the ability to sell software as a subscription benefit from the public cloud marketplace, which provides these capabilities, enabling them to offer more options for the consumption of their software and services. Global market: Cloud marketplaces are available globally, thus enabling access to international markets that were probably out of reach for some ISVs. Buyers: CSP customers can browse, purchase, and consume software directly on the platform without relying on their internal sourcing and procurement processes.
Cloud Operations | 121
Benefits for Buyers With these new features, the adoption of public cloud marketplaces among buyers has soared. For example, AWS alone processed over $1 billion in purchases in 2019 through the AWS Marketplace. Of course, that has benefited it by driving up the transaction volume on the platform, which is why all the CSPs are now actively promoting their marketplaces and keep adding features. But buyers benefit as well, enjoying the following advantages:
• Free trials: Most ISVs on cloud marketplaces provide free trials
• •
•
that allow an enterprise to subscribe to the service and try it out within minutes, without risk. With free trials, an enterprise can evaluate multiple vendors’ solutions, allowing a quick assessment of their functionality, performance, and security before making a purchase. Faster purchase and deployment: Cloud marketplaces enable an enterprise to go from selecting a product to deployment within minutes, with only a few clicks. Less overhead: The use of hundreds of products in cloud applications has increased the burden on procurement teams. Cloud marketplaces streamline the entire procurement process by standardizing procurement and providing tools to track and monitor usage and spend. In addition, some cloud marketplaces also allow the enterprise to negotiate special pricing directly with the vendors. The procurement team no longer has to follow its typical process, saving a significant amount of time and resources. Drawdown on committed spend: An enterprise agreement with a CSP typically involves a spending commitment over multiple years, and in return, the CSP offers significant discounts. The larger the commitment, the larger the discount; however, forecasting your spending pattern over a few years can be risky, since both company strategy and market conditions can change. The enterprise must pay the CSP the agreed-upon amount, regardless of usage. Cloud marketplace purchases count toward the spending commitment and can help meet the commitment numbers.
Cloud marketplaces are still evolving, with CSPs and ISVs getting more creative with their offerings. It is important to note that an ISV is not just selling on the public cloud marketplace; it is also consuming IaaS and PaaS services when it builds and tests applications on the CSP’s platform. For example, if an ISV decides to sell its product on AWS, Azure, and GCP, it will need to design, build, and test the product on all three CSPs. This would require a cohesive multicloud strategy to address deployment. Public cloud marketplaces offer ISVs a new channel and route to market, which will require
122 | Successful Management of Cloud Computing and DevOps
different business processes to support sales, marketing, and finance and back-office integrations. Cloud marketplaces also change how sourcing and procurement teams purchase third-party software and services. Whether an enterprise is an ISV selling on the public cloud marketplace or a consumer, there are implications to using cloud marketplaces that should be considered as part of the overall, enterprise-wide cloud strategy.
CHAPTER 5
Cloud Metrics and Monitoring
I
n this chapter, we will list a few signals and metrics that matter the most in the cloud and some that may not matter. We anticipate that the on-prem and cloud metrics will coexist, as most of the current transformations to the cloud will be in hybrid mode. Note that there is a difference between metrics in the cloud and on-prem. There are monitoring metrics covering various aspects of application availability, application adoption, application usability, performance of various application components, mean time to detect, and mean time to respond, among others. These metrics help to get operational insights and provide key performance indicators (KPIs) to optimize business performance, ensure that there is business continuity, and enable better-informed IT and business decisions.
Metrics Evolution of the Monitoring and Related Metrics Monitoring and reporting were an area of concern for CSPs, as their business evolved around reporting and financial management, which should be transparent to users. Network monitoring was an area of complexity for industry leaders like Cisco due to high security of the devices and lack of ability to translate the activities taking place in the device without slowing it down. Network monitoring spans from Simple Network Management Protocol–based reporting to complex security scans such as deep packet inspections. While Cisco and others had several of these monitoring facilities, CSPs wanted all of them in one place, thereby making it simple and effective for reporting purposes.
123
124 | Successful Management of Cloud Computing and DevOps
Cloud versus On-Prem—When, What, and Where to Monitor Applications hosted both on-prem and in the cloud need a comprehensive monitoring strategy for immediate identification of issues and faster resolution. Your platform should be able to collate the information from the cloud components and on-prem components in real time and correlate it. There is a huge difference between data collected dynamically and data collected after the fact. Also, in the event of a potential failure, the extent to which the details of each event are captured becomes critical to your ability to fix it. There is a substantial cost associated with capturing finite details, and you will have to evaluate the risks as well as the benefits before purchasing a third-party monitoring package. Where to insert the probe, how often the data needs to be collected, and what kind of data needs to be propagated upstream are all very important considerations. Continuous monitoring could be a double-edged sword, as it can significantly affect your application performance and consume resources. Monitoring and measurements need deep domain and business knowledge—they are not for rookies.
Common Monitoring Framework Next, we will examine each metric in detail and evaluate how they can influence your cloud journey. These definitions and metrics are taken from our experience as cloud practitioners and may not represent every metric that exists. However, the discussion will give you an overall picture of these metrics and help you build up a comprehensive sense of what matters while operating with a cloud. There are two major aspects of cloud monitoring and analysis:
• Application performance monitoring (APM) • IT operations analytics (ITOA) These can be both proactive and reactive in nature. Monitoring can be divided into the following categories:
• Infrastructure monitoring • Network monitoring • End user monitoring • Performance and scalability monitoring • Security monitoring • Compliance monitoring • Identity and access monitoring
Cloud Metrics and Monitoring | 125
Specific to the cloud, we have the following additional categories of monitoring:
• Multicloud/hybrid cloud monitoring • Cost monitoring • Availability and SLA monitoring • Reliability • Response time There are cloud metrics associated with each of these categories. You can allocate them into the following categories:
• Time • Cost • Quality Various aspects need to be considered while identifying the process maturity and defining measures for the software and associated life cycles. Proper monitoring and probing of the application will start with the application architecture, application design, application development, and the skills and knowledge of the application development team. There are security aspects to be included by writing secure code, code scanning for static code vulnerabilities, and timely resolution of these issues. Automated testing of applications for vulnerabilities and compliance violations, and reporting about them are critical aspects. Application API authorization should be in place to ensure that users are authorized (coarse-grained and fine-grained) before providing data or performing insert, delete, and update operations on data. The application architecture should have firewall layers to protect the application and database from cyberattacks. The application design should take into consideration user authentication and authorization. All client IDs and secrets should be secured in a vault. The vault is a central place that can store the security credentials in a structured and secure fashion so it can be managed centrally. It is protected by several layers of access control and encryption mechanisms. These security criteria can be defined as baselines for applications to meet. The acceptable percentage of security baseline criteria that get met must be calculated. A risk score needs to be calculated for any criteria that are not met. This score needs to be tracked across all applications during development to identify the quarter-to-quarter trends and determine the maturity score. Compliance auditing will validate that the security controls are functioning as planned. Completing such audits on a periodic basis will also give guidance about organization-level security readiness. Let us now examine a few
126 | Successful Management of Cloud Computing and DevOps
critical metrics for each of the areas that are traditionally monitored and their significance in a new world.
Cloud Service Metrics Cloud service metrics can be based on time, cost, and quality parameters, and this will be covered in this section. Since security is considered another aspect of concern in the cloud, there are additional measures for security. Finally, we will examine a few business-related measures.
Time-Based Metrics
• Service availability • Mean time between failures (MTBF) and mean time to repair •
(MTTR) Response time and latency
Service availability: This refers to the guaranteed availability of your cloud and the data associated with it that are offered by the CSP. Usually, it is measured in terms of the percentage of time during which the service is available. As mentioned in Chapter 3, this is usually called five nines (99.999%), which essentially means that your network and systems will not go down for more than five minutes in a whole year. For a CSP, it is very important to maintain this availability, as the basic value of the cloud itself is based on the promise of “always available and always on.” In Chapter 6, we will explain the concept of high availability, which essentially means that your network and services should be available practically 100%. Mean time between failures (MTBF): This measures the average time between two failures. Usually, this is a quantitative measure that is an average over a period of time and is adjusted as more incidents occur (depending upon the reporting period, whether quarterly or yearly). Mean time to repair (MTTR): This refers to how much time it takes to recover from a failure incident. This is a common measure for both on-prem and the cloud, as it indicates the organizational agility and technical and infrastructure bench strength of the organization. The service expectations are usually determined when an SLA is signed between your company and potential and actual customers. SLA definitions track the MTTR very diligently, as it indicates how well you can run your
Cloud Metrics and Monitoring | 127
business in a given environment. They are crucial parts of any vendor or customer contract and usually are vetted by the company’s legal team before being shared with customers. Response time and latency: These statistics track the round-trip time for a request to be fulfilled and how many hops it needed to traverse before reaching a given destination. Latency, a common term in networking, simply refers to the delay caused by intermediate systems (router, switches). In a local data center connected across racks where databases (DBs) and applications are hosted side by side, it is possible to get sub-millisecond response times for your request. However, when you move your workload to the cloud, this will heavily depend on the uplink and downlink bandwidths of your network. With your workload moving to the cloud, you will start experiencing more delay in your response time, and the problem can get worse when your payload is in a different cluster or geographic area. Monitoring and measuring this metric help you plan the location of your servers and your network capacity and bandwidth.
Cost-Based Metrics
• Total cost of ownership (TCO) • Cost-benefit analysis (CBA) • Return on investment (ROI) One of the main reasons for using the cloud is the financial viability of avoiding a capital-intensive data center. As a starting point, we will review these cost-based metrics that are important to an organization’s chief financial officer (CFO) and chief information officer (CIO), not to mention the chief executive officer (CEO). Then we will examine some of the ongoing metrics like resource utilization, computer and storage costs, and capacity waste with cloud management alerts. Total cost of ownership (TCO): This is an interesting planning parameter, as it includes both tangible and intangible aspects of the planning. If you look at traditional on-prem data centers, the major tangible parameters include building, utility, infrastructure (i.e., racks, cabling, and power backup), and equipment costs. In addition, there are ongoing costs incurred in operational overhead, ongoing maintenance, license, upgrades, renewals, and compliance. Intangibles include provisioning, the carbon footprint, intellectual property, data and security control, the costs of failure, troubleshooting, and branding.
128 | Successful Management of Cloud Computing and DevOps
Cost-benefit analysis (CBA): This is a methodical and systematic approach of considering an aspect of the proposal and estimating its cost, as well as the costs of any alternatives. Here, we establish the baseline of investment need for the cloud transformation and compare it with what would happen if we did not move to the cloud. Once we establish the baseline, we should consider the short- and long-term strategic and tactical outcomes, thereby allowing the planners and finance department to take calculated and informed risk. Return on investment (ROI): The basic rule of ROI is the formula ([Benefit − Cost]/Cost), expressed as a percentage. ROI is a common measurement of anything we do in the financial world, and it allows planners to predict the future return possibilities of an investment being made now. For a decision maker to move the existing data center to the cloud, it is important to know the implications of time and money being invested to achieve the potential gain. For medium- to small-scale firms, the decision making may be easier, as they do not have the capacity to build their own data centers. However, it is recommended that they figure out the ROI calculations to assess the impact and outflow of a decision. Even for such a scenario, the overall investment planning may be surprising, considering some of the associated bottlenecks of security, data policy, and vendor lock-in. A small start-up may not have the option of adopting a multicloud or hybrid cloud, and that makes thinking about ROI very important.
Quality-Based Metrics
• Cost of quality (COQ) • Cloud scalability • Cloud elasticity • Cloud efficiency • SLA violations Cost of quality (COQ). According to the American Society for Quality (ASQ), this term is defined as a methodology that allows an organization to determine the extent to which its resources are used for activities that ensure high quality, appraise the quality of the organization’s products or services, and result from internal and external failures. Having such information allows an organization to determine the potential savings that can be had by implementing process improvements. Cloud scalability. One of the main attractions of the cloud is the ability to scale and perform based on the dynamic demand. Cloud scalability is the
Cloud Metrics and Monitoring | 129
ability of a given cloud to expand its resources and services to meet the increased demand of an application for a given period. These fluctuations can be time bound or seasonal, but the SLA between the CSP and the host organization should measure the scalability aspect of provisioning the services. Even though the CSP can offer infinite elasticity in theory, it may not actually be true, as every cloud data center also may have limitations that you may not know about. For instance, a special event such as a black Friday sale can affect the load factor on a website. Estimating the peak scalability requirements based on empirical data while planning your cloud hosting will be important. Cloud elasticity. This is the ability to dynamically adjust the resource requirements at any given time. It may sound similar to scalability but there is a crucial difference. Cloud scalability requires the CSP to provision your additional capacity for a given specification at a given location for a longer period and with a dedicated SLA, whereas elasticity is simply the ramp-up and ramp-down of necessary resources to meet the needs of the current situation. Consider a situation when the CPU load goes above the threshold value of 80; the cloud may add CPU capacity to handle the situation until it comes under control and that is elasticity. Scalability requires additional CPU capacity to be provided for a longer time as the application has grown from the initial time. Cloud efficiency. This is the (often measurable) ability to avoid wasting materials, energy, efforts, money, and time when doing something or producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without waste (OpsCruise). As you can imagine, one of the prime marketing arguments for the cloud is its ability to manage resources and its pay-for-use philosophy. Pay-for-use is only an efficient use of resources and a way to avoid waste. In traditional data centers, planners anticipate well in advance and provision much more than the anticipated usage as the cycles of provisioning happens once in two years or less frequently. So, we build additional capacity that is never used fully at any time as we continue to build more capacity once we reach a certain level (may be 70%). With the cloud offering elasticity, the need for building extra capacity is nonexistent. Measuring the cloud efficiency becomes critical for you as it can offer you greater value and cost savings over a period. SLA violation. SLAs are often considered an important tool to keep ongoing support and maintenance on a par with the initial offering. There are various means to measure the SLAs, including quarterly reports published to make sure that all types of agreements are satisfied. It is a common practice to include a provision for penalties if SLA terms are violated. SLAs can be
130 | Successful Management of Cloud Computing and DevOps
in several areas, including cost, availability, speed, and performance. SLA terms are highly dependent on the type of business and the corresponding requirements of the business.
Cloud Security and Security Metrics Security is one of the primary concerns for organizations that want to move to the cloud, with many related issues. Consider the following initial metrics for cloud security:
• Severity of the security vulnerability • The risk and impact of any security breach • The available time to resolve the issue (if applicable) These metrics can be used to measure both the proactive and the reactive aspects of cloud security. Severity. Each security vulnerability is identified by a common vulnerabilities and exposure (CVE) identifier. CVE is a unique, common identifier for publicly known information-security vulnerabilities in publicly released software packages. Every vulnerability will have a severity score, assigned by the Common Vulnerability Scoring System (CVSS), a free and open industry standard for assessing computer system security vulnerabilities. The CVSS score (ranging from 0 to 10, with 10 being the highest) is calculated based on a formula that takes into consideration the ease of exploitation of the cloud and associated assets in the cloud and the impact of such a breach. Risk. Risk is a score based on the business priority of the application, classification of the data the system is handling, overall impact of a security breach on revenue and the company’s reputation, and the probability of exploitation of the vulnerability. The organization should assign weights to each category as they come up with the risk score. Time for closure. Every vulnerability identified in an application should be assigned a deadline to resolve it. This is required to track if it is closed on time. This will also help companies to accept the risk for a specified time and track the progress of problem resolution against a timeline. The time for closure needs to be decided by the security team based on severity score and risk score. If the risk is low and severity is low, the time for closure can be longer. The longer the security vulnerability exists, the greater the chance that the data and brand will be compromised.
Cloud Metrics and Monitoring | 131
Finance Metrics Cloud revenue has grown many times over during the last few years, and so has the waste of resources provisioned in the cloud. A recent study (Chapel, 2019) showed that more than $14 billion is spent on cloud-based waste and underutilization. This section will examine a few of the critical parameters that you should be aware of as a cloud planner so that you can bring financial discipline into your organization. Capacity versus utilization. This is an important aspect of cloud planning and efficiency. Efficiency is defined as being directly associated with reducing waste. There are very attractive pricing models such as pay-as-you-go, which allows you to incrementally enable your services and pay only for what you use. Make sure to monitor closely the measurement of what you provisioned and how you used the capacity in order to decide the optimal investment. All CSPs have this information readily available for you, so make this check a part of your quarterly business review. Discounts and rebates. These tools can be very tricky to evaluate unless you really know what you are going to do with them. In our view, discounts and rebates are negative marketing tactics and should be handled with care. If you are a beginner and want to experiment with the cloud and learn to use it, then it is worthwhile considering them. There are cloud credits offered by various providers to allow people to learn freely, and indirectly it helps to create word-of-mouth marketing. So, this is beneficial for both parties. Monitoring. This allows you to measure what you use precisely and compare it against what is committed. CSPs sometimes offer certain types of monitoring parameters for free, but more often, they offer monitoring as an additional service, and the cost of this can create additional overhead. The CSPs charge for monitoring and this can create additional traffic and data consumption for your assets. It can become a double whammy unless you plan your monitoring and measurement processes well in advance. Use your best resources to design them, and only monitor what matters. SLA violations. These can be a real concern from both a financial and legal standpoint. Poor planning and unskilled personnel can make such errors, which make the company look bad. In our experience, we have seen penalties come up during a discussion when the client wanted to switch to a new provider. There may be hidden terms that can really haunt you. Read your SLA carefully and make sure that your legal and finance department fully vet it before you sign it.
132 | Successful Management of Cloud Computing and DevOps
Data retrieval and archival. These policies are really an afterthought for many planners, as the issues are not so obvious at the time you design your infrastructure. However, this is one of the worst money drainers in the long term. Data stored anywhere (no matter online or offline) is going to cost money, as are the protection, archival, retrieval, and migration of data. CSPs make their profits from the data they hold. Having a proper strategy for data storage, backup, archiving, and purging is very important for financial planning. Average revenue per client. This statistic helps to assess the soundness of a CSP and allows you to plan your investment thoughtfully. If the CSP is losing money no matter how good their market position reported in the stock market, think twice about putting your assets with them. Given the current growth in the cloud, the top five players in the industry may not have issues, but it could be a concern if you are considering using new and smaller competitors.
Commonly Used Business Valuation Metrics Business valuation metrics allow you to assess the overall effectiveness of using the cloud as your platform. By looking at these metrics, you can assess the overall customer experience and how likely it is that they will continue to use your services and potentially renew or even upgrade them. The following aspects are worth considering. Operational cost. This is the cost associated with maintaining the business day to day. It also includes IT allocation as a major portion. The IT budget is spent on hosting services for the business, and thereby enabling it to increase momentum and output. As a planner, it is important to consider the implications of investing in the cloud as opposed to keeping assets on the premises. The costs of service, acquisition, and operation provide key inputs and comparative analysis of one versus the other. Monthly recurring revenue and nonrecurring revenue indicate the health index of a specific CSP. They reflect the ability of the CSP to convert a potential short-term, trial customer to a repeat user and grow the recurring revenue on a positive trajectory. Customer lifetime value (CLV). The CLV is a specific metric that a CSP uses to measure its customer conversion rate from a specific marketing campaign. There are formulas and methods to calculate the CLV, and it is interesting to watch how this metric trends for the CSP. As you plan your enterprise, it can give you a good idea of whether you should take advantage of a specific offer from the provider. There are also terms like churn rate and retention rate used in conjunction with CLV, which are useful while evaluating a CSP.
Cloud Metrics and Monitoring | 133
Churn rate. The number of customers who leave a business in a given period of time is known as the churn rate. It is a percentage value and inversely proportional to the retention rate (discussed next). Churn is an important aspect for a CSP, as the competition is getting tougher and retaining service levels and SLAs is critical. Retention rate. The number of customers who stay with a business is the retention rate. It is better to keep an existing customer than to lose it and replace it with another. Any new customer has a loaded cost and cost of sale associated with it. The amount of recurring business and product sales is a very important indicator of customer loyalty and commitment to an enterprise’s products and services. Net promoter score (NPS). This comprises the overall sentiments of customers toward your company. NPS is a good indicator to keep a tab on the pre-cloud and post-cloud migration impact on your customer base. As a CIO, use specific questions in the NPS survey to assess your customer experience and sentiments.
Metrics Relevant to the Database Data and its associated operations and transactions and other activities are critical for optimal management of the cloud. For a CSP, success is defined by its ability to offer speed and accuracy of transactions. The world of IT often revolves around the management and processing of data in various forms and capacities. In the following section, we will examine metrics related to data and databases. Availability and uptime. This is a critical concern for an on-prem DB admin, and many times, having five nines is essential to job safety itself. CIOs report on this in their service reviews and invest significant OpEx and CapEx in making sure that this KPI stays within the necessary range. CIOs plan for high availability and invest in DR and backup infrastructure to make sure that uptime and data availability is maintained. Backup and restore. Backup and restore are considered among the primary responsibilities of a database administrator (DBA). Most DBAs are valued for their ability to help a company recover after a disaster. Every quarter, this is reported as a key indicator of an organization’s preparedness, stability, and resilience. The cloud offers backup and restore as an automated service in most cases, but it can get very complicated with data in transition and the growing number of compliance requirements in different countries.
134 | Successful Management of Cloud Computing and DevOps
Resource and capacity consumption. Computer storage and bandwidth constitute another area that is keenly and actively monitored and are among the reported parameters for a traditional hosting environment. Pager service and a dedicated support team with 24/7 on-call personnel must be available to monitor the vital statistics of the environment, and any threshold breach is considered a priority one incident. The number of servers and number of DB nodes are other parameters that are monitored for planning and capacity purposes. The expansion of the size of the DB and nodes is considered a negative indicator, as it is an indication of lack of archives or poor planning of resources and administrative overhead. Fixing such problems is often a tedious exercise, which makes CIOs worry about the number of DBs under them, many of which are not even updated or used for months. Since real estate is the most concerning sunken cost, it is very expensive if rack space is created and not used in the on-prem data center. It is one of the most concerning sunken costs for a company. These metrics are often available by default by CSPs. Tools and mechanisms to analyze slow queries, cluster metrics, and schema metrics are offered by CSPs, so DBAs no longer have to worry about tracking them as part of their day-to-day work; these responsibilities now have shifted to the CSP. The cloud also allows you to configure the threshold for specific conditions of your DB, such as usage, performance, and accounts. Long query performance and responsiveness. We recommend that this should be a key parameter to monitor once you move to the cloud, as it also indicates the network, bandwidth, and location aspects of your CSP and any changes that might occur to the network speed and latency. It is a complex parameter to measure, but it can provide excellent insight into overall CSP responsiveness.
Cloud Monitoring Fully managing and monitoring your applications and the associated infrastructure are important to protect your business and enhance the user experience. Traditionally, application and infrastructure monitoring were considered complicated and costly due to the wide variety of systems and lack of a common standard. Over the past several years, however, the monitoring spectrum has evolved much faster. Standardizations and unifications of metrics helped the monitoring space to be accessible even to smaller organizations and allowed them to deploy commercially available monitoring solutions. With assets moving to the cloud and more and more
Cloud Metrics and Monitoring | 135
applications hosted out of the public domain, it is important to monitor and measure the outcome and effectiveness of your operations. In the following section, we will examine a few of the best-known monitoring measures. Various CSPs have their own application monitoring tools or mechanisms. For example, AWS has Kinesis, GCP has StackDriver, and Azure has Azure Monitor as monitoring tools for their customers. Along with this, monitoring applications like AppDynamics and Splunk using the API interface can collect data from any of the cloud connectors and present a comprehensive view of the metrics that are specific to your organizational requirements. The following are the key areas of cloud monitoring, but they also apply to on-prem services:
• Computer and related monitoring • Response time • Error rates • Customer and user experience–related monitoring Computer and related monitoring. These areas are very specific to basic computing resources like the CPU, memory allocation, disk input/output (I/O) error, and disk read/write performance. They are the building blocks of any hosting environment and can greatly influence the performance, scalability, and sustainability of your application environment. CPU load and average CPU utilization are two common monitoring parameters used to monitor the computer. Disk I/O, average disk reads, and average disk writes are important parameters to monitor the health of a storage system. Response time. This is a key indicator in the cloud, especially as your assets are outside of the corporate network. Response time is influenced by many parameters, both directly and indirectly. The network bandwidth, hosting zone of your CSP, and underlying capacity provisioned are among the key primary influencers. There can be numerous secondary influencers as well, like shared pool usage and fine-tuning a database to improve its overall performance. This area needs careful consideration and monitoring in order to align with the right performance level. Network I/O, maximum network in, and maximum network out are important monitoring parameters that can help to assess response time. Error rates. This includes the rate of errors associated with application performance and degradation in app usability and performance. Associated troubleshooting and relevant root cause analyses can help to bring stability and long-term sustainability to an app’s ecosystem.
136 | Successful Management of Cloud Computing and DevOps
Customer and user experience–related monitoring. This is the toughest item of all to monitor, as it requires a lot more widespread and deep analysis that often must be measured from intangible outcomes like cloud adoption and overall cloud usage. AppDex and NPS (discussed in the previous section) are some common standards used to measure the user experience and customer perceptions. A few other aspects like request rates, uptime, parallel instances at a time, and garbage collection from failures are used in the cloud domain to assess the health of the system. These focus on active users, spikes in usage, inactivity, and application availability and reliability. They can vary depending on the underlying programming language and infrastructure (e.g., JAVA, Python, or Nodejs). Given the depth and breadth of cloud monitoring space and the several branches spanning multiple areas, we will focus on infrastructure, application, application performance, end user, and end-to-end.
Infrastructure Monitoring Components mapped to an application in terms of system availability, CPU utilization, memory usage, storage availability, and disk utilization provide detailed insight into infrastructure health. The basic building blocks of the cloud are the computer, storage, and network. Infrastructure monitoring focuses specifically on these three areas to report and alert administrators if anomalies are detected. For a locally operated data center, infrastructure monitoring is done by one consolidated and coordinated team as the network, system and database personnel work together to take care of it. Troubleshooting the infrastructure-related issues requires tight coordination among all internal departments, as well as external players like the internet service provider (ISP). Organizations spend millions to make sure that the best technical bench strength is maintained and protected. The ability of an IT team to bring back normality after an outage is critical, and organizations focus heavily on these aspects. Network monitoring. Monitoring network traffic between end users and the application is important to ensure that bottlenecks are identified. This means that there is a need to have an end-to-end view of the digital delivery of applications and services over the internet. The most critical part of any cloud hosting is the internet connection. The internet can be a single point of failure, no matter how well you set up the capacity and how much money was spent on building the cloud infrastructure.
Cloud Metrics and Monitoring | 137
Database and intermediate-layer monitoring. Today, organizations depend on databases and other middleware and customer relationship management (CRM) solutions to run their business and perform transactions. It is important to enable monitoring and reporting of the DB, middleware, and business-to-business transactions to make sure that the end-to-end picture of your application ecosystem is properly tracked and reported. While most DB monitoring tools can generate alerts and reports on DB performance or failures, it is good to deploy tools that can also do a deep inspection into the root causes of the error and allow for troubleshooting.
Application Monitoring As per research, by 2025, 50% of new cloud-native application monitoring will use open-source instrumentation instead of vendor-specific agents for improved interoperability, which is up from 5% in 2019. Application telemetry and deep application monitoring is a new norm regardless of the location of the application (whether in the cloud or an on-prem data center). Traditionally, application monitoring was only an extension of monitoring application health like uptime, exception logging, and response time. With the onset of newer technologies like AppDynamics and Splunk, the monitoring really has gone deep into the roots of the application, including all the vitals at the API and decision tree levels. While talking about applications, we also need to consider one more important factor, which is the database. Most of the market players (both CSPs and other types of companies) come with built-in DB monitoring and knowledge. Intellectual property in the market has very much matured when it comes to traditional players like Oracle, SAP, and NoSQL. Having said that, the onset of numerous open-source DBs and the SaaS offerings of such DBs makes this a new challenge for DB monitoring. Most of the time, the performance of an application and of a DB can be mistaken as being synonymous. There are two types of application monitoring: Proactive (or live) monitoring: This can be termed as streaming telemetry, where the probes within the application can indicate all the parameters as they happen. AppDynamics is the best example of this type of monitoring, which can send proactive alerts from the dynamic probes that it inserts into the application. Reactive (or log-based) monitoring: Properly written applications generate logs, which can be very useful in tracing the source of an issue, even at a later point. Popular tools like Splunk can generate very useful insights from logs collected from the applications, and thereby do a proper root-cause analysis.
138 | Successful Management of Cloud Computing and DevOps
Application Performance Monitoring (APM) Application performance monitoring (APM) is similar to a computed tomography (CT) scan in medicine; it examines all aspects of your application and the associated environment. APM is not an afterthought; rather, it should be adopted at the time of code development itself. Applying APM principles and tools will allow a developer to insert the right probes in places where it matters and to trace the factors that influence performance and scalability at the time of development. The field of APM has gained prominence over the last couple of decades due to the realization of a direct relationship between the applications and the future of a business. It is essential to consider factors like programming language support, cloud adaptability and pricing, and open standard connectors as you choose APM tools. There are several industry leaders that offer a variety of services, like AppDynamics, Scout, AppInsight, Steel Central, and Stakify Retrace. Which one is used depends on the programming language. Some of the interesting features they provide include the following:
• Memory leak detection • Stake trace • DB query analysis • Latency, error-based alerts • Language extension support • Apdex scores to measure end user satisfaction • Middleware and messaging component monitoring • Cloud insights • Dashboards and metrics to track and trace business, user, and application language indicators
End User Monitoring (EUM) With digitization, end user experiences are very influential over the adoption of services today. Any impact on users’ adoption of a feature or functionality due to bad user experience design, application rendering time, or mobile or browser compatibility issues can affect a company’s bottom line. End user monitoring (EUM) can provide end-to-end visibility of web and mobile applications and monitor and troubleshoot slow web/mobile network requests. Due to the widespread use of various devices and sensors at user’s sites, monitoring and measuring end user experience is difficult.
Cloud Metrics and Monitoring | 139
Generally, EUM can establish the load factor between server performance vs. user interface. It also needs to consider the impact of APIs and thirdparty plug-ins as part of the overall experience. User interface and user experience are often misrepresented. User experience is a combination of several factors and user interface is one among those. Measuring the end user experience is not about the user interface alone; rather, it also involves the time it takes for the information from the user interface to travel from the interface to the data center and then back to the user. The user interface and user experience are very similar; if an experience is not good, the user is going to try some alternative to your services. Therefore, it is an important part of your overall strategy to keep a close eye on the quality of the user experience. With EUM, you can gain various types of visibility:
• Geographically, where there may be heavy and slow application • • • • •
loads Performance variance by end device, location, operating system, browser, and application version CSP, carrier network, and bandwidth dependency The slowest Ajax/web request and the source of that request Application server performance and the impact of that to the end user and the mobile network End devices, such as IoT and other terminals and how the signals and network affect them
End-to-End Monitoring There are many different aspects of monitoring, starting with infrastructure, applications, DB, end user, user interface and user experience, and a variety of metrics associated with them. Monitoring each of these areas will give an in-depth understanding and visibility to those areas of applications. End-to-end monitoring is the sum of everything we have discussed so far. It is the overall experience of the ecosystem, which includes hardware, software, performance, scalability, and user experience. Until recently, there was no single tool or vendor that could provide such an end-to-end view. There were specific areas on which they focused, but not comprehensive coverage. With the arrival of vendors like AppDyamics and Splunk, however, end-toend monitoring is a booming business today. There will be more about this later in this chapter.
140 | Successful Management of Cloud Computing and DevOps
Cloud Monitoring Solutions The majority of large enterprises still revolve around on-prem hosts, data centers, or assets managed privately by outsourced or colocation models. This picture is changing rapidly as the COVID-19 pandemic encouraged an increase in remote working and virtual services, which accelerated the decision to digitize and the development of a cloud-oriented mindset. The technology landscape expanded with the onset of cloud technologies like containers, and orchestration engines like Kubernetes (see Chapter 6 for more information), microservices, and serverless computing. When companies deploy applications in the cloud environment, comprehensive cloud monitoring is essential to ensure business continuity and faster issue identification and resolution. Whether it is a public, private, or hybrid cloud, keeping an eye on the performance, health, and availability of cloud services is important. It is very challenging to find optimal cloud monitoring solutions that suit the needs of your organization and provide support in identifying emerging defects and troubleshooting them before they turn out to be a major roadblock. Since each CSP differs in its monitoring definitions and SLAs, adopting open telemetry will make it easier for CSPs and users alike to collect data and get a complete picture of an application’s associated assets. In the case of multicloud hosting, the host enterprise must consolidate all the necessary monitoring metrics and create a dashboard internally to keep track of the KPIs. There are well-defined monitoring solutions published by each CSP, and users can pick and choose customized monitoring for their enterprise usage along with the standard offerings. It is important to map interdependencies between applications, services, processes, and cloud components. Also, make sure that the cloud monitoring tools can provide comprehensive reports on IaaS, PaaS, DBaaS, and SaaS. While monitoring VMs and containers, keep track of metrics such as .NET CLR, thread/process count, and memory surges, along with key network statistics. For storage accounts, keep an eye on blobs, tables, files, and queues, as well as outages by monitoring system-related alerts provided by CSPs so that appropriate actions can be initiated. For DBs, watch out for database transaction unit usage, read-write utilization, lock details, and locked queries. CSPs offer a wide range of monitoring solutions. Some of them are free; others are not. The following gives a general overview of the offerings of the most popular providers: Amazon CloudWatch: This includes a custom dashboard builder, which allows you to classify and categorize the data and metrics to provide wide varieties of analyses in a split second. In addition, it offers API
Cloud Metrics and Monitoring | 141
extensions that can expand and store data for future use. Amazon CloudWatch provides event-based, log-based, and metrics-based data for your account. It also allows each server to send logs, which essentially means that you can monitor even non-AWS-based server logs. This is a key differentiator and comes in very handy when you are operating in a hybrid cloud. AWS focuses on heavy automation and self-service by providing lots of choice to users, thereby making it one of the best cloud monitoring tools. Google Stack Driver: Google Stack Driver allows you to pull metrics, logs, and analytics from GCP, Azure, and AWS, which is very convenient. This is a key capability, as it allows multicloud and cloudagnostic monitoring. Stack Driver monitoring allows you to set alerts and thresholds so that it can report anomalies of the backend systems. Azure Monitor: Azure Monitor is a very comprehensive monitoring solution that provides activity logs to track all the aspects of your subscription-related activities, including details about your infrastructure. Diagnostic logs offer details about individual resources in your custody, which span across load balancer application gateways, and others. Azure Monitor also offers metrics generation by allowing you to access the logs and monitors through representational state transfer (REST) APIs, thereby giving the end user the power to customize the API and end user experience. Another creative feature from Azure is its health metrics, which provide both global and local health aspects by which you can get to know the overall performance of the cloud cluster and compare your own provisions and common maintenance windows, patches, upgrades, and outages.
Splunk Splunk entered the realm of APM following its acquisition of SignalFx in October 2019. SignalFx is a SaaS-only APM and infrastructure monitoring solution focused on cloud-native environments. Splunk has offerings that also cover several aspects of DevOps, along with a broad set of IT operations solutions (namely, IT Service Intelligence, VictorOps, Splunk Enterprise, Splunk App for infrastructure, and Phantom for automation). SignalFx enables distributed tracing, offering open-source agents based on OpenTracing and OpenTelemetry. Splunk’s road map for APM involves the integration of SignalFx’s real-time streaming and analytics platform and the Microservices APM product with Omnition, which together allow topology visualization based on trace data. This provides the ability to correlate microservice
142 | Successful Management of Cloud Computing and DevOps
behavior with underlying infrastructure based on correlating metrics and traces from SignalFx with logs from the broader Splunk portfolio. SignalFx’s SmartAgent can monitor the Kubernetes environment and can have informative visualizations in the Kubernetes Navigator. Infrastructure monitoring can be used to track your use of AWS, GCP, and Azure. Also, if you instrument your code, it is possible to send custom metrics of your application to SignalFx to monitor. It can collect metrics from Microsoft Windows systems, and the data can be visualized using charts and dashboards. Detectors can be configured to monitor signals on a plot line, as on a chart, and generate alerts. Splunk APM monitors applications by collecting traces. A trace is a collection of actions that occur to complete a transaction. Each action in a trace is known as a span. APM collects and analyzes every span and trace that an application’s instrumentation generates. This provides full-fidelity, infinite cardinality exploration of trace data that an application generates, enabling you to break down and analyze the application’s performance along any dimension. SmartAgent or Open Telemetry collectors can be deployed for APM. Alerts can be configured over deviation in metrics for trace data. It is possible to correlate trace data to other resources in real time, including infrastructure monitoring and logging tools. Splunk provides libraries to automatically instrument applications, but APM is instrumentation-agnostic; it supports many popular instrumentation libraries for custom instrumentation, including Open Telemetry, Jaeger, and Zipkin. A business workflow is the start-to-finish journey of the collection of traces associated with a given activity or transaction. It can be used to monitor and troubleshoot end-to-end transactions in the system. Splunk APM generates monitoring metric sets that you can use to chart performance to identify the root cause of unexpected behaviors and resolve any service bottlenecks. Global data links are dynamic resources that connect an APM to a dashboard or external resource such as Splunk logs, Kibana logs, and custom URLs. With DevOps, Agile and blurring boundaries of development and operations, the need for integrated quality and compliance has become paramount. Monitoring and metrics need to be defined at the time of writing the code. Telemetry probes should be included inside the APIs and microservices such that every aspect of the software and applications can be monitored and measured. There are also newer techniques like synthetic monitoring (proactive monitoring, as described earlier in this chapter), which uses emulated scripts to test the transactions. Synthetic monitoring techniques allow administrators to continuously monitor availability and response times using prescript programs without the need for real-life traffic. This, along
Cloud Metrics and Monitoring | 143
with passive monitoring, can be an excellent combination, each complementing the other. Knowing what, when, and how to monitor will define the future of your business and the future of the cloud in general.
References Chapel, Jay. 2019, February 4. “Cloud Waste to Hit over $14 Billion in 2019.” DevOps.com. https://devops.com/cloud-waste-to-hit-over-14 -billion-in-2019 Cloud Wastage. Griswold, Victor Jon. 1991. “Core Algorithms for Autonomous Monitoring of Distributed Systems.” ACM SIGPLAN Notices, 26(12): 36–45, https://doi.org/10.1145/127695.122762.
CHAPTER 6
Optimized DevOps
T
he top five enterprises in technology today (namely, Amazon, Microsoft, Google, Facebook, and Apple) are running their businesses through the cloud. Amazon, Google, and Microsoft offer commercial cloud services to end users. Thus, it is vital to know more about the principles of using the cloud. But before we jump into that topic, we need to understand how an application, or a software program, is developed. What happens behind the scenes of a software build?
Monolithic Applications With the explosion of the internet, the possibilities of applications and the associated infrastructure and services have grown multifold. How the growth of the internet and the associated scalability and performance issues have become a roadblock is an interesting jumping-off point for our journey to the cloud. Next, we will examine how the monolithic, single-machine, singleprogrammer era evolved and transformed into collaborative programming and the unlimited possibilities of the cloud. Traditional application development revolved around two main aspects. The first is the host system, where you develop and deploy the application, and the second is the software or language used for developing the application. The host in this case is a system with hardware and an operating system (e.g., a personal computer with Microsoft Windows or Linux). And the languages used to develop the application include Java, Python, Perl, C, and others. Imagine that we are developing a small application that can keep track of the travel expenses of a group of friends. You are the developer, and you quickly create the application on your Linux personal computer using Java. You then compile it to work. This application will now be hosted on the same machine, and an interface to the application will be provided using a 144
Optimized DevOps | 145
web browser such as Google Chrome. You and your friends can enter your data and check the latest expenses of the group. The data entered will get stored in a text file on your computer. Life is so simple with this framework— you don’t have to worry too much about performance, scalability, database, security, versioning, and other issues since you are the only developer and you control your own code base. You know that your friends are the only ones who will enter data, and their names are stored in a hardcoded file for access control. You do not have testers, as you test it yourself. You are not concerned about the underlying operating system or versions, as you have all the libraries and environmental parameters stored in the same personal computer, which is also your hosting machine. You have also not planned for exposing libraries to others or pulling data from other applications. Most of the code is written in a traditional way. Your entire application goes down if any portion of the code fails, which requires you to troubleshoot and bring the application up again. You now know what we are talking about—yes, it is a monolithic application. A simple hint that you are using a monolith is the usual story that you hear about outages such as those of an airline booking application or a bank payment site. It can be seen when an entire application goes down and nothing works until someone restarts the whole stack.
Infrastructure and How It Evolved Monolithic applications are highly dependent on the underlying operating system and hardware. The application is one single unit; one module failure brings it all down. Any upgrade or bug fix requires the entire application to be rebuilt and restarted. Testers test the same software image, which is built only once and is often run in a common place with somewhat similar hardware and operating system configurations. Any change in either the hardware or operating system can make the software highly susceptible to failure. Moving the application to a different place (e.g., shifting data centers) requires all of the hardware to be moved as well. Any scalability or performance improvement is highly limited by the capacity of the underlying hardware or operating system. As the internet and associated technology started to mature, this started becoming a roadblock. In the 1990s, those in the field of IT were very familiar with the concept of mainframes and languages such as COBOL and Pascal. Most banking and other large-scale systems were all connected to a mainframe (such as from IBM) or similar centralized computing systems. The power of computing and storage lay in the central server, and all the end devices were just terminals and could operate without computing power. Novell Netware was a predominant player in local area networking (LAN), and it also had NetWare
146 | Successful Management of Cloud Computing and DevOps
servers. Novell operated on the principle of the centralized server and clients with less power, but the ability to connect to a central location. With the evolution of desktop software and personal computers, the scenario started shifting. The availability of cheaper storage, more memory, and advancement in computer chips made production of personal computers with greater capacity possible. The next challenge was: how do we connect these personal computers over a network and achieve even more power? Wide area networking (WAN) was the answer. Suddenly, personal computers became more powerful, as they could be connected by a local network, or they could be geographically distant and connected by a WAN, or what is known as distributed computing. Hardware companies (e.g., IBM, Dell, and Hewlett-Packard), along with software companies (e.g., Intel and AMD), started creating powerful machines cheaply and on a smaller scale that nearly matched the power of room-size servers of the 1990s. Fast forward to the 2010s, several internet providers created distributed computing assets connected via cloud servers and started overlooking the mainframes. Novell went out of business as Microsoft gained nearly the entire personal computer software market thanks mostly to its Windows operating system. In the first decade of the new millennium, Apple entered the market with its Play Store, Google became a trademarked synonym for search, and more and more applications and software were created over the internet. In 2008, Cisco announced a major decision: it would focus on a horizontal expansion of its data center switching business to include so-called blade servers. The idea was simple: while you connected to a data center with Cisco Switch, the same switch could offer computing and storage with much less use of real estate, power, and heat emissions. To everybody’s surprise, Cisco’s blade server business, called unified computing systems (UCS), became a billiondollar platform and soon was an industry standard. Today, almost all cloud infrastructures use the blade server concept. In the meantime, companies like Intel and AMD were partnering with network makers on networking and computing capacity, which could boost both bandwidth and speed. They were trying to increase throughput by achieving more efficiency over WAN. Kilobits per second (Kbps), a luxury in the 1990s, soon was replaced with megabits per second (Mbps), gigabits per second (Gbps), and eventually terabits per seconds (Tbps). Cisco’s award-winning router, the gigabit switch router (GSR), could send any packet from any port to any other port at lightning speed, which essentially meant that you could almost achieve the computing and storage power of your personal computer using a virtual resource over the internet. Cisco started developing ports on
Optimized DevOps | 147
their devices that could support 10 Gbps (10 gig, as it was called) and soon were upgraded to 40 gig, 100 gig, and even 400 gig. Cisco devices hosted several ports in the same chassis, which essentially meant that one switch or router could handle hundreds of such parallel gigabit transactions with wire speed, a term used to describe the transfer of a packet from one port of the switch to another without any kind of latency. You may wonder at this point where these computing and networking stories are heading. They laid the foundation for what we call the cloud today. In simple terms, the cloud is the result of combining all the previous technology into one package offered by players like Amazon, Google, and Microsoft, among many others. During the first decade of the millennium, none of these companies were even thinking about the cloud as a service. Amazon dealt in selling books online; Google was focused on being a search engine; and Microsoft specialized in personal computing. For companies like Microsoft, the cloud was actually a threat at first, as they thought it could challenge their PC market and take the whole world back to where we started with the IBM mainframe (because the cloud proposed to centralize computing, storage, and networking, thereby avoiding the need for highend personal computers). But what we witnessed in the second decade of the 21st century is a remarkable transition; these three players tapped into what Cisco, Intel, and other pioneers created with the network computer and made it the basis for the cloud.
High Availability of Applications and Infrastructure As the hardware evolved, started scaling beyond the walls of a building, and spread across international borders, the industry started focusing on application availability. Continuity of critical services and uninterrupted experiences soon became the norm. The concept called 99.999% availability was often considered the desired state in SLAs. CSPs started making great strides toward a notion called high availability, or always available.
The Process Layer—Waterfall, Agile, and CI/CD The application development started growing from one developer environment into a larger ecosystem, where multiple programmers and testers came together to write software. This collaborative effort required processes and rules agreed to by everyone, and the industry adapted the model used in the factory lines, called the waterfall method. In this method, developers and testers get the full picture of the product toward the end of the production cycle. This means that they may have to fix hundreds of defects, in a highly laborious process called a bug scrub, and work overtime to move them to a
148 | Successful Management of Cloud Computing and DevOps
fixed, duplicate, or junk state based on the release criteria. Unless planned well, this can have a negative impact on delivery timelines and the overall quality of the product. Industry experts started thinking of a better way to achieve efficiency and make the process faster and more predictable, avoiding the last-minute rush. One of the fundamental obstacles was the way that programming was done and the way that teams were organized around the development methodology. In a traditional application development, the entire application is considered as one unit, and everyone contributes to the entire application code base, which is centrally maintained by a version control team with tools such as ClearCase or CVS, a version control system used to manage file versions. Every developer used the same code base and was committed to the code on the same branch. The developers had to wait for others to complete and test the entire code at once. There are several branching and merging strategies available with waterfall for achieving parallel development. However, it still needs additional effort compared to Agile. Many companies have found the approach inefficient and adopted modular programming, along with Agile and microservices, to improve it. While Agile gives speed and flexibility, it should not be assumed that it gives inherent quality to its products. Waterfall offers excellent quality and consistency for those programs and systems, so it should be used when quality is the prime concern. When the first few versions of a software program are more focused on proving a specific hypothesis than on quality, waterfall may be too much of a burden. Most systems-level programming and hardware areas still use the waterfall approach, as it requires extremely high quality and consistency at every step. As a word of caution, Agile may create levels of uncertainty, as there may be a poorly defined scope in the name of speed and a large number of conflicting priorities from various parties in the name of flexibility. If not managed well, it can bring down the overall efficiency of the process.
The Agile Methodology—A New Method to Divide and Conquer Traditional or classic development methods (waterfall) offered a wellplanned way to meet customer requirements through planning, development, testing, deployment, and maintenance. This model was mostly inspired by assembly production lines, which followed a very methodical movement of items from one line to the next. The hard handoff, or manual handoff, made it clear that the responsibility was moving from one person to the next. The most prevalent behavior of this model was the process and checklist during every handoff and the lack of immediate feedback. The testers had to wait for the entire product to be ready before they started testing,
Optimized DevOps | 149
the build engineers had to wait for every branch to get merged before they can start, and the release engineers needed all the software to be prepared, tested, and scrubbed before they could do their work. The release windows were preannounced, and if you missed one, you’d have to wait for the next one. The only times when the software could be put out to production were once per month or during a major release window. This made the development cycle longer and placed the defect cycle toward the end of the product life cycle; by doing so, the defects started reaching the customers. Another issue was the process overhead created by having more people at the project management office managing the developers and testers and making them report metrics with mostly repetitive aspects. Customers had to wait for a long time to have their problems fixed. Overall, it was becoming too much of an issue, and people started thinking of a better, faster, and easier way of developing software and building applications. That opened up a new era of continuous development, which moved the cycles in parallel and in a more iterative way. There were process changes, methodology changes, and cultural changes. At the same time, there were new thoughts about more robust and faster development cycles, which resulted in what is known today as Agile development. Figures 6.1 and 6.2 represent the classical development model with sequential process, and also the waterfall method which is an enhanced version of classical model. Agile development methodologies address some of the perennial problems of the waterfall model. It collapsed the wall between various stakeholders involved in the process. Instead of separate teams involved in development, test, release, and the other steps in the process, Agile created a virtual team called the scrum team. While scrum addressed the team structure primarily, it also eliminated several handoffs and started creating dynamic teams and a process that made it easy for everyone to collaborate, thus making the product more consistent. The data and metrics were made available in the scrum room, thereby making it easy to provide instantaneous feedback, which made it easy for the project team to have real-time information. The Classic Development Lifecycle
Plan
Develop
Build
Test
Release
Customer Feedback, Net Promotor Score
Figure 6.1 The classic development cycle.
Deploy
Operate/ Monitor
150 | Successful Management of Cloud Computing and DevOps
End-to-End Application Life Cycle without CI/CD Innovation
Plan
Ideas
Source
Plan
Quality
Delivery
Development
Customer Requirements Engineering
Make
Customer and Field Support
Launch
Testing Regression Define
Prioritize
Enhancements
Build Deployment Monetize
Manage
Gather Commit Modeling Construction Deployment Maintain
Figure 6.2 Waterfall development.
one-dimensional and one-directional movement of artifacts of waterfall were now changed to two-directional in Agile as issues from a subsequent phase of code could come back to the design phase until it was resolved. This method helped in collapsing the time window for releases.
How Does the Cloud Enable Us to Transform DevOps? DevOps (short for “Development and Operations”) is a concept that did not originate from the cloud. Rather, it had already created a general playground for application developers and infrastructure providers to come together and make meaningful adjustments to the process to make it smooth and measurable. With the operations team joining hands with application developers and infrastructure providers, end-to-end application development, deployment, hosting, and supporting become part of one common process.
DevOps Basics While Agile paved the path of process and methodologies, DevOps bridged the gap with continuous integration (CI), continuous delivery (CD), and continuous deployment. DevOps improves the interactions between the development and operational sides of the enterprise and brings them together in a well-orchestrated process and tool chain. CI allows for integrating code into a central repository when it is developed, allowing others to test it and fix any defects. It allows you to left-shift the defect cycle—a term used to describe the process of finding defects early in
Optimized DevOps | 151
the development cycle and hence fixing them quickly and improving time to market. CD allows you to put your code into a testing environment and accelerate the testing and fixing cycles. It opens up the lines of communication between the testers and developers by removing the wall between them. Continuous deployment is an end-to-end process and tool chain that allows the application ecosystems to develop and deploy software in a continuous loop in a fully automated manner. The software build can get deployed automatically to a production environment almost immediately, accelerating the life cycle of development considerably. Figure 6.3 demonstrates the CI/CD process. Continuous testing and continuous regression (CR) are other great additions to this ecosystem. As and when new code gets added to the mainstream, or there are any changes to the existing code base, there are mechanisms that trigger various types of testing automatically. CR can be a great enabler for limiting the collateral damage created by a new code, thereby ensuring backward compatibility and stability for the code in production. The steps starting from unit testing all the way through the automated tool chain to regression testing can be implemented. Many organizations have end-toend automation for managing the entire CI/CD pipeline with testing, compliance, and security at every level of the release process. Some commercial CSPs and other companies offer this as a service, and the same is available from SaaS in the cloud as well. CI/CD Basic Process and Roles Continuous Monitoring, DevSecOps Cl Plan
Development
CD Test
Build
Release
Deploy
Operate/ Monitor
Application Telemetry Scrum Master Analyst Architect QA Dev Team
QA Dev Team
QA Dev Team
Figure 6.3 CI and continuous deployment.
QA Scrum Master Business Team Analyst Security Prime
Dev Team Ops Team
152 | Successful Management of Cloud Computing and DevOps
DevSecOps DevSecOps (short for “development security operations”) is an extension of the DevOps concept with more focus on the quality and security aspects of Agile software development methodologies. With Agile and DevOps expanding their reach beyond on-prem offerings, the focus suddenly shifted to security. One of the earlier limitations of the cloud, which almost made people turn away from it in the early stages, was due to security concerns. With more applications moving to the cloud and accessible to the internet, it increased the vulnerability of each application and its associated data. Security testing has become a cornerstone of cloud application development. In the overall CI/CD chain, security has become an integral part (hence the name DevSecOps). In summary, successful implementation of DevOps and CI/CD involves several factors and tool chains, all connected and communicating with each other. These tool chains help to reduce manual handoff and errors, facilitate team interaction, and allow the team to scale beyond the boundaries of the geography and infrastructure. Figure 6.4 depicts a standard view of the CI/CD automated pipeline, while Figures 6.5 and 6.6 take a closer look at what happens inside an Agile process. As we go deep into the process, we can see that the structure allows multiple iterative cycles called scrum, with daily scrum reviews. After several of these reviews and iterative devolvement, each sprint makes something called the minimum viable product (MVP), which is a working version of the software. The MVP is not the final version, but it can stand by itself and give an early view of how it will be, thereby allowing stakeholders to provide early feedback. A good example of a prominent player for the Agile system in the market is IBM, which was a leader in the waterfall methodology with ClearCase and its rational suite of software. IBM invented Agile and DevOps solutions such as IBM UrbanCode, which helps the Agile development team to manage, build, and deploy software in an automated way. The best part of the UrbanCode solution is that it can work for either on-prem or cloud deployment and makes CI/CD more efficient and effective.
The New Application Architecture There are three elements to emphasize:
• Modular programming and application programming interfaces (APIs)
153
Figure 6.4 Agile with CI/CD development.
CI/CD
Test
Prioritize
Plan
Build
Plan
Commit
Define
Development
Enhancements
Engineering
Customer Requirements
Ideas
Innovation
End-to-End Application Life Cycle with CI/CD
Stage
Daily Scrum
Cl/CD
Production
Deploy Dev/QA
Quality
gress
Delivery
Sprint
Working Software
Sprint Review & Retro
Monetize
Burndown Task Board Target Process Chart Burndown Task Board Jira Chart Burndown Task Board Rally In To ProChart Done Do
Make
Sprint Planning Sprint Backlog
Product Backlog
Source
Manage
Launch
Customer and Field Support
154 | Successful Management of Cloud Computing and DevOps
Commit
Build
Development
Test
Stage
Deploy Dev/QA
CI/CD
Production
Figure 6.5 Agile with CI/CD development. Burndown Chart
Task Board To Do
In Progress
Done
Daily Scrum
Product Backlog Sprint Planning Sprint Backlog
Sprint
Sprint Review & Retro Working Software
Figure 6.6 Agile with CI/CD development: The sprint.
• Service-oriented architecture (SOA) • Microservice architecture While Agile and waterfall are concepts of the software life cycle, various technologies and methods are used across this spectrum. APIs, microservices, and SOAs offer specific means of achieving the various stages of the life cycle.
Modular Programming and Application Programming Interfaces Modular programming is a way of developing software in which you design the software to make each module of the program independent and selfsustainable. In other words, you divide your entire software into modularized functions, and each function can execute independently, while being loosely coupled to the main program. In this way, it gives good flexibility and
Optimized DevOps | 155
autonomy to the ecosystem. Since each module is independent, it allows other programs to have a similar requirement to reuse these modules. It also allows you to debug and monitor the modules more efficiently. By doing this, the software development is distributed to thousands of loosely connected individuals who are spread across continents but come together to achieve a common goal. API is a term often used to represent the connection between multiple modules or functions. An API helps to connect those loosely coupled modules through well-defined inputs and outputs. Consider this as a gateway between two independent applications, modules, or functions. SOA is the principle or command center that defines what can be shared by an API. It defines specific sets of data and output that can be consumed by another application. It also makes sure that this information is accessible over the internet for use from anywhere in the world. SOA defines a number of principles around sharing data and protocols with everyone in the ecosystem. You may also come across the term REST API while dealing with these topics. REST stands for “representational state transfer,” a term often used in web-based programming. Keep in mind that REST is a set of rules and constraints that a developer follows while exposing the data, which allows others to fetch it through something often called a web service.
Microservice Architecture Microservice architecture is a leap forward from the monolithic framework. It divides a large application or software into smaller chunks of independent, loosely coupled services, which are in a self-contained code base. As opposed to the monolith, here the application developers are not bound to use the same language to build modules that are based on the same framework. They can independently update and fix bugs in their application without affecting the overall software. This flexibility is a great enabler for cloud-native programming and cloud computing, as the underlying resources could be independently allocated to a service rather than to the entire application. Each service can be bundled independently as a package that can be deployed anywhere. As we explore more about the microservice architecture, you can see great similarities to the SOA and API principles discussed earlier. APIs are an inherent part of this method, and microservices follow the principles of SOA. An additional benefit is that different independent services can bundle together to meet a common business objective and can be deployed anywhere. Microservices connect the programming and underlying infrastructure and define a package that can coexist but is loosely coupled. To make
156 | Successful Management of Cloud Computing and DevOps
the best use of microservice architecture, we will discuss the cloud-native architecture next.
Cloud Native: Application Architecture for the Cloud We have just discussed the Agile methodologies, SOA, CI, CD, DevSecOps, microservices, and the API framework. When we combine all these principles and apply them to our software development, it is called cloud native architecture. Cloud native is a generic term used to describe the architecture, enterprise principles, agility, and use of technologies and methodologies. It can allow an application to use the principles of the cloud and cloud-enabled business models. Cloud-native systems recommend that you follow certain basic principles:
• Use microservice architecture. • Use containerization or the principle of modularization. • Make your application loosely coupled and self-reliable. • Make your design hardware independent. • Make it deployable anywhere. • Provide elasticity and the ability to use the cloud’s computing facilities.
Automated Cloud Deployment and Building Blocks We have already examined the transition from monolithic software development to a more nimble and Agile software development process and how it shapes our decisions on the cloud and associated deployment frequencies. Another interesting aspect of the cloud is the hardware and associated ecosystems. Even though the cloud is an easy way of building, deploying, and maintaining software and applications, the cloud itself is a set of interconnected hardware. Anyone who wants to become a cloud-native programmer must understand the real underlying construct of the cloud from a hardware and infrastructure point of view. In the following section, we will deliberate the evolution of hardware and networking, which helps the cloud environment. There are four basic constructs in the cloud: public, private, community, and hybrid. While we examine the cloud deployment models, we will also delve into some of the basic framework through which these cloud deployment models evolved. There are groundbreaking innovations in the hardware and networking technologies, which allowed industry to virtualize and connect everything over the internet. Cloud computing and cloud assets have
Optimized DevOps | 157
evolved over the last two decades as a result. We will go over these principles of hardware next, by discussing topics like virtualization, containerization, docker, and Kubernetes.
Virtualization Virtualization allows you to create a virtual platform on top of BareMetal (a term used for physical hardware), giving you the flexibility of configuring it to the needs of your application and its operating system. Using virtualization, you can run multiple VMs on top of one physical machine, each of which can run your choice of operating system. As a matter of fact, you can run both the Windows and Linux operating systems on the same system, where the underlying hardware is the same but it is running two VMs. Virtualization is not only applicable for hardware or systems. Generally, there are four types of virtualizations: network virtualization, desktop virtualization, storage virtualization, and application virtualization. With the virtualization of resources, opportunities are created to effectively utilize the resources with more flexibility. The software layers can be successfully separated from the underlying hardware. There are several principles that make virtualization possible: A VM is a simulated computer that uses the underlying resources of a physical machine but acts as an independent unit for an application running on it. The main purpose is to provide an isolated environment for each application by creating an environment that is more user friendly and scalable. It allows you to share underlying hardware with multiple operating systems. In normal cases, it is allowed to run only one operating system on one physical device. VMs allow each operating system to have a separate kernel. Hypervisor is a term often used in the context of virtualization and VMs. Stated simply, it is an agent that facilitates the creation and management of VMs and allows for sharing of resources—computer, storage, and memory. The concept of a container is almost the same as a VM except that a container does not have its own kernel. A container shares the underlying kernel from the host system. Both containers and VMs can hold an application independently and operate out of the same underlying hardware. However, each VM has its own operating system, whereas containers share an underlying operating system. Containers use the underlying operating system through an aggregator called
158 | Successful Management of Cloud Computing and DevOps
docker (discussed next). There are other similar products, but we will use docker as the example in this chapter. Similar to the role that hypervisor plays with VMs, docker pairs up with containers to create a package on which software and applications can be hosted. Docker is a Linux container agent that uses Linux kernel features like controller groups and namespaces to create containers on top of the operating system. The main benefit with docker is the burden of the operating system and kernel is suddenly moved out of the application developer’s hands and is handled by the middle layer. All an application developer needs to worry about is the binaries and libraries and not the operating system itself. In the case of VMs, the operating system itself was part of the machine, and that made it another part of the application developer’s responsibility. Docker removed that worry of maintaining a separate kernel for each application. With that gaining momentum and because most of its properties and libraries are universally available as open source, applications developed in one environment could be easily transported to any other machine. If there is a docker agent between the application and the underlying kernel, you can pretty much stop worrying about hardware. That is a big shift from the traditional world of computing and deployment. Figure 6.7 demonstrates the principles that we’ve covered so far on virtualization and containerization. With this mode, you can run many applications in the same underlying hardware, but they operate in their own defined world. This makes the application independent and not overly consuming of resources or straining of other applications. Each VM or container is allocated with resources that it can claim, and if it exceeds them, then the thresholds are monitored. With the cloud, we will see the elasticity of resources to the VMs and containers, and applications can claim additional power depending on the demand. With the availability of VM/hypervisor and container/docker, the possibilities of application development moved from a single-system focus to a virtual world with an “anytime, anywhere” mode of application hosting. Developers can focus on the application rather than the underlying operating system and hardware. Programmers can now spend more time on what matters most for the benefit of the application experience rather than worrying about the infrastructure. As we look closely, docker is a lightweight VM, as well as a software packaging and delivery platform. Application development is becoming easier thanks to these concepts, and people started to look for more scalability.
Optimized DevOps | 159
APP-1
APP-2
VM1,LIB,BIN
VM2,LIB,BIN
APP-1
APP-2
OS for VM1
OS for VM2
C1,LIB,BIN
C2,LIB,BIN
HYPERVISOR
DOCKER
OS FOR THE HOST SYSTEM
OS FOR THE HOST SYSTEM
SYTEM HARDWARE
SYTEM HARDWARE
VIRTUAL MACHINES WITH HYPERVISOR
CONTAINER WITH DOCKER ENGINE
Figure 6.7 VMs and containers.
Kubernetes Kubernetes is an open-source container orchestration platform that can help put containers into logical groups. What, exactly, is container orchestration? Think of a situation where you need more computing, storage, and memory, which one machine cannot offer alone. What options do you have? Logically, you think of buying another machine and connecting it to your computer to combine their power. Practically, however, it is not that simple to arrange, and the complexities involved are far more than an application developer can handle. The other option is to extend the VM or containers to the second machine, keeping the underlying hardware connected somehow. This option sounds interesting and much better than the option of connecting two machines using wires and making the operating system and kernel communicate with each other. Kubernetes comes in handy here. It can simply act as an agent that connects multiple containers into a cluster, no matter which host it comes from. Suddenly, we are free of the limitations of a single host environment and now have greater elasticity with resources. Let us now explain the basic framework of Kubernetes. It can connect multiple VMs or even a BareMetal host by considering them as clients, which the Kubernetes master refers to as a cluster. Inside the cluster there is a pod, which is nothing but a logical collection of VMs or containers that represent an application ecosystem. Consider a pod as a single container. The Kublet is the agent of the master nod, which keeps the health check and other scheduling activities for the agent nods. There is a docker present in each
160 | Successful Management of Cloud Computing and DevOps
KUBERNETES MASTER
KUBERNETES NOD
KUBERNETES NOD
DOCKER
KUBELET
POD
POD
DOCKER
POD
KUBELET
POD
Figure 6.8 Kubernetes basics.
node, which manages containers for that nod. Overall Kubernetes extends the docker concept from one machine to multiple machines. Figure 6.8 provides a sample representation of this concept. A single monolithic application hosted on a dedicated Windows or Unix machine was introduced back in 2000, and over the next 20 years, there has arisen a cluster of VMs that can be anywhere, managed by someone for you and scaled up and down on any resources you demand. To summarize this discussion, using techniques such as virtualization, containers, docker, supervisor, and Kubernetes really shapes the application development ecosystems into a virtual world with unlimited possibilities. These technologies have broken the boundaries of physical machines and any limitations and constraints of resources by opening the software into a universally available platform. The application is developed using microservice principles and in an Agile methodology using CI/CD pipe. In Figure 6.9, the following is demonstrated as Kubernetes is employed as a central engine:
• Code repository changes will trigger Jenkins (a utility to trigger and manage software builds) in order to build and transmit docker images to the docker registry.
Docker Registry
Watch for changes and deploy
Figure 6.9 Deployment with Kubernetes as a central engine.
Code Repository
Code Change
Build and push docker image Controller
Master
Cluster Store
API Server
Scheduler
Commands – JSON/YAML
kubectl
Kubernetes Deployment
Pod1 Pod2
Pod1 Pod2 Pod1 Pod2
Node Node Node
161
162 | Successful Management of Cloud Computing and DevOps
• A standard Kubernetes cluster will have one or more master nodes • • • • •
and many worker nodes. Developers use Kubectl to interact with a Kubernetes cluster. Each worker node may have one or more pods, and each pod can run one or more containers hosting application images. Kubernetes master node will communicate with worker nodes to deploy the latest containerized images based on configuration. Kubernetes master node keeps monitoring worker nodes and will ensure availability of pods and deployed application images. AWS Elastic Cloud Kubernetes (EKS) is a managed service by AWS to run Kubernetes clusters on AWS. AWS CloudFormation can be used to launch a cluster of AWS EC2 worker nodes. Based on the configuration file, EKS control plane will communicate to worker nodes and launch containerized application images. The control plane will monitor and manage worker nodes for high availability.
You can see from Figure 6.10 that these clusters are capable of autoscaling based on the load factor and built-in redundancy by design. The model shown here is based on AWS deployment, and similar modes are available with GCP and Azure.
Application Deployment Frequency Up to now, we have discussed the methods of DevOps and CI/CD and also browsed through the principles of modular programming. Once the software is developed and various programmers commit their code to a common repository, the next action is bundling various modules of the software and making it a usable package. This is known as the build and deployment cycle. As pointed out earlier, the monolithic and waterfall approaches treated build and deployment in a serial mode. With CI/CD gaining prominence, build and deployment gain more momentum, making it possible to do them at any time. Even though the deployment of a software program can happen anytime in theory, most enterprises follow a certain frequency pattern for doing it because of the factors influencing the user experience, market dynamics, and continuous deployment. The industry refers to this as deployment frequency, a term that is all about how soon and how fast your code can get deployed for testing and production. Practically, you can deploy your code anytime with the CI/CD model, but the frequency is really decided based on the type of code commitment and the stages of your projects.
163
Client
Build and push docker image
Load Balancer
kubectl
Auto Scaling Group
Worker Nodes
ENI
EKS Control Plane
Commands – JSON/YAML
Watch for changes and deploy
AWS EKS
Docker Container Registry
Figure 6.10 Deployment with Kubernetes: AWS example.
Code Repository
Code Change
Kubernetes Deployment - AWS
164 | Successful Management of Cloud Computing and DevOps
Deployment Behavior and Frequency Deployment Frequency
On Demand
Nightly
Weekly
Monthly
CI/CD fully in practice; as needed deployment; test images
Sync and merge dynamically but deployment once in day
CI/CD followed but deployment only once in week
Once in month to once in six months; major deployments
MTTC – Less than a day Mean Time to Change
Less than a week Less than a month One month to six months
MTTR –