296 23 35MB
English Pages 437 [450] Year 2021
Machine Learning with the Elastic Stack Second Edition Gain valuable insights from your data with Elastic Stack's machine learning features
Rich Collier Camilla Montonen Bahaaldine Azarmi
BIRMINGHAM—MUMBAI
Machine Learning with the Elastic Stack Second Edition Copyright © 2021 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Group Product Manager: Kunal Parikh Publishing Product Manager: Devika Battike Senior Editor: David Sugarman Content Development Editor: Joseph Sunil Technical Editor: Devanshi Ayare Copy Editor: Safis Editing Project Coordinator: Aparna Nair Proofreader: Safis Editing Indexer: Manju Arasan Production Designer: Alishon Mendonca First published: January 2019 Second published: May 2021 Production reference: 1270521 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80107-003-4 www.packt.com
Contributors About the authors Rich Collier is a solutions architect at Elastic. Joining the Elastic team from the Prelert acquisition, Rich has over 20 years of experience as a solutions architect and pre-sales systems engineer for software, hardware, and service-based solutions. Rich's technical specialties include big data analytics, machine learning, anomaly detection, threat detection, security operations, application performance management, web applications, and contact center technologies. Rich is based in Boston, Massachusetts. Camilla Montonen is a senior machine learning engineer at Elastic. Bahaaldine Azarmi, or Baha for short, is a solutions architect at Elastic. Prior to this position, Baha co-founded ReachFive, a marketing data platform focused on user behavior and social analytics. Baha also worked for different software vendors such as Talend and Oracle, where he held solutions architect and architect positions. Before Machine Learning with the Elastic Stack, Baha authored books including Learning Kibana 5.0, Scalable Big Data Architecture, and Talend for Big Data. Baha is based in Paris and has an MSc in computer science from Polytech Paris.
About the reviewers Apoorva Joshi is currently a security data scientist at Elastic (previously Elasticsearch) where she works on using machine learning for malware detection on endpoints. Prior to Elastic, she was a research scientist at FireEye where she applied machine learning to problems in email security. She has a diverse engineering background with a bachelor's in electrical engineering and a master's in computer engineering (with a machine learning focus).
Lijuan Zhong is an experienced Elastic and cloud engineer. She has a master's degree in information technology and nearly 20 years of working experience in IT and telecom, and is now working with Elastic's major partner in Sweden: Netnordic. She began her journey in Elastic in 2019 and became an Elastic certified engineer. She has also completed the machine learning course by Stanford University. She leads lots of Elastic and machine learning POC and projects, and customers were extremely satisfied with the outcome. She has been the co-organizer of the Elastic Stockholm meetup since 2020. She took part in the Elastic community conference 2021 and gave a talk about machine learning with the Elastic Stack. She was awarded the Elastic bronze contributor award in 2021.
Table of Contents Preface
Section 1 – Getting Started with Machine Learning with Elastic Stack
1
Machine Learning for IT Overcoming the historical challenges in IT Dealing with the plethora of data The advent of automated anomaly detection Unsupervised versus supervised ML Using unsupervised ML for anomaly detection Defining unusual
4 4 5
Learning what's normal 10 Probability models 10 Learning the models 12 De-trending14 Scoring of unusualness 16 The element of time 17
7
Applying supervised ML to data frame analytics 17
8
The process of supervised learning
8
18
Summary19
2
Enabling and Operationalization Technical requirements Enabling Elastic ML features
21 22
Enabling ML on a self-managed cluster Enabling ML in the cloud – Elasticsearch Service
22 26
Understanding operationalization34 ML nodes 34 Jobs35 Bucketing data in a time series analysis 36 Feeding data to Elastic ML 38
ii Table of Contents The supporting indices Anomaly detection orchestration
39 41
Anomaly detection model snapshots
42
Summary43
Section 2 – Time Series Analysis – Anomaly Detection and Forecasting
3
Anomaly Detection Technical requirements Elastic ML job types Dissecting the detector
48 48 50
Geographic74 Time75
The function The field The partition field The by field The over field The "formula" Exploring the count functions Other counting functions
50 51 51 51 51 52 53 67
75
Detecting changes in metric values70 Metric functions
70
Understanding the advanced detector functions
72
rare72 Frequency rare 73 Information content 74
Splitting analysis along categorical features Setting the split field The difference between splitting using partition and by_field
76 78
Understanding temporal versus population analysis 79 Categorization analysis of unstructured messages 83 Types of messages that are good candidates for categorization The process used by categorization Analyzing the categories Categorization job example When to avoid using categorization
84 85 86 86 92
Managing Elastic ML via the API 92 Summary94
4
Forecasting Technical requirements 95 Contrasting forecasting with prophesying96
Forecasting use cases 97 Forecasting theory of operation 97
Table of Contents iii
Single time series forecasting Looking at forecast results
100 115
Multiple time series forecasting121 Summary124
5
Interpreting Results Technical requirements 126 Viewing the Elastic ML results index126 Anomaly scores 131 Bucket-level scoring 132 Normalization134 Influencer-level scoring 135 Influencers 137 Record-level scoring 139
Results index schema details Bucket results Record results Influencer results
Multi-bucket anomalies Multi-bucket anomaly example
140 141 145 149
150
Multi-bucket scoring
152
Forecast results
154
Querying for forecast results
Results API Results API endpoints Getting the overall buckets API Getting the categories API
Custom dashboards and Canvas workpads Dashboard "embeddables" Anomalies as annotations in TSVB Customizing Canvas workpads
155
157 158 158 160
161 162 164 166
Summary169
151
6
Alerting on ML Analysis Technical requirements 172 Understanding alerting concepts172 Anomalies are not necessarily alerts In real-time alerting, timing matters
172 173
Building alerts from the ML UI 176 Defining sample anomaly detection jobs176 Creating alerts against the sample jobs 182 Simulating some real-time anomalous behavior188 Receiving and reviewing the alerts 191
Creating an alert with a watch 193 Understanding the anatomy of the legacy default ML watch Custom watches can offer some unique functionality
193 200
Summary202
iv Table of Contents
7
AIOps and Root Cause Analysis Technical requirements 204 Demystifying the term ''AIOps'' 204 Understanding the importance and limitations of KPIs 206 Moving beyond KPIs 209 Organizing data for better analysis211 Custom queries for anomaly detection datafeeds212 Data enrichment on ingest 216
Leveraging the contextual information216 Analysis splits Statistical influencers
217 218
Bringing it all together for RCA 218 Outage background Correlation and shared influencers
219 221
Summary227
8
Anomaly Detection in Other Elastic Stack Apps Technical requirements 230 Anomaly detection in Elastic APM230 Enabling anomaly detection for APM 230 Viewing the anomaly detection job results in the APM UI 236 Creating ML Jobs via the data recognizer238
Anomaly detection in the Logs app240 Log categories
240
Log anomalies Anomaly detection in the Metrics app
Anomaly detection in the Uptime app Anomaly detection in the Elastic Security app
242 243
246 249
Prebuilt anomaly detection jobs 250 Anomaly detection jobs as detection alerts253
Summary255
Section 3 – Data Frame Analysis
9
Introducing Data Frame Analytics Technical requirements 260 Learning how to use transforms260
Why are transforms useful? The anatomy of a transform
261 262
Table of Contents v Using transforms to analyze e-commerce orders Exploring more advanced pivot and aggregation configurations Discovering the difference between batch and continuous transforms Analyzing social media feeds using continuous transforms
262
transform configurations Introducing Painless
276 276
267
Working with Python and Elasticsearch
270
A brief tour of the Python Elasticsearch clients289
271
Using Painless for advanced
288
Summary296 Further reading 297
10
Outlier Detection Technical requirements
300
Discovering the four techniques used for outlier detection 301 Understanding feature influence 304 How does outlier detection differ from anomaly detection? 306
Applying outlier detection in
practice308 Evaluating outlier detection with the Evaluate API 313 Hyperparameter tuning for outlier detection 320 Summary 323
11
Classification Analysis Technical requirements Classification: from data to a trained model Feature engineering Evaluating the model
326 326 331 332
Taking your first steps with classification 333 Classification under the hood: gradient boosted decision trees340
Introduction to decision trees Gradient boosted decision trees
341 342
Hyperparameters342 Interpreting results 345 Summary350 Further reading 350
vi Table of Contents
12
Regression Technical requirements Using regression analysis to predict house prices Using decision trees for
352 352
regression Summary Further reading
360 362 363
13
Inference Technical requirements 366 Examining, exporting, and importing your trained models with the Trained Models API 366 A tour of the Trained Models API Exporting and importing trained models with the Trained Models API and Python
366
369
Understanding inference processors and ingest pipelines 373 Handling missing or corrupted data in ingest pipelines
383
Using inference processor configuration options to gain more insight into your predictions
385
Importing external models into Elasticsearch using eland 387 Learning about supported external models in eland 387 Training a scikit-learn DecisionTreeClassifier and importing it into Elasticsearch using eland 388
Summary393
Appendix
Anomaly Detection Tips Technical requirements Understanding influencers in split versus non-split jobs Using one-sided functions to your advantage Ignoring time periods
396
Ignoring an unexpected window of time, after the fact
396
Using custom rules and filters to your advantage
402 405
Ignoring an upcoming (known) window of time 405
Creating custom rules Benefiting from custom rules for a "top-down" alerting philosophy
408
411 411 413
Table of Contents vii
Anomaly detection job throughput considerations Avoiding the over-engineering of a use case
414 415
Other Books You May Enjoy Index
Using anomaly detection on runtime fields 416 Summary419 Why subscribe? 421
Preface Elastic Stack, previously known as the ELK Stack, is a log analysis solution that helps users ingest, process, and analyze search data effectively. With the addition of machine learning, a key commercial feature, the Elastic Stack makes this process even more efficient. This updated second edition of Machine Learning with the Elastic Stack provides a comprehensive overview of Elastic Stack's machine learning features for both time series data analysis as well as classification, regression, and outlier detection. The book starts by explaining machine learning concepts in an intuitive way. You'll then perform time series analysis on different types of data, such as log files, network flows, application metrics, and financial data. As you progress through the chapters, you'll deploy machine learning within the Elastic Stack for logging, security, and metrics. Finally, you'll discover how data frame analysis opens up a whole new set of use cases that machine learning can help you with. By the end of this Elastic Stack book, you'll have hands-on machine learning and Elastic Stack experience, along with the knowledge you need to incorporate machine learning into your distributed search and data analysis platform.
Who this book is for If you're a data professional looking to gain insights into Elasticsearch data without having to rely on a machine learning specialist or custom development, then this Elastic Stack machine learning book is for you. You'll also find this book useful if you want to integrate machine learning with your observability, security, and analytics applications. Working knowledge of the Elastic Stack is needed to get the most out of this book.
x
Preface
What this book covers Chapter 1, Machine Learning for IT, acts as an introductory and background primer on the historical challenges of manual data analysis in IT and security operations. This chapter also provides a comprehensive overview of the theory of operation of Elastic machine learning in order to get an intrinsic understanding of what is happening under the hood. Chapter 2, Enabling and Operationalization, explains enabling the capabilities of machine learning in the Elastic Stack, and also details the theory of operation of the Elastic machine learning algorithms. Additionally, a detailed explanation of the logistical operation of Elastic machine learning is explained. Chapter 3, Anomaly Detection, goes into detail regarding the unsupervised automated anomaly detection techniques that are at the heart of time series analysis. Chapter 4, Forecasting, explains how Elastic machine learning's sophisticated time series models can be used for more than just anomaly detection. Forecasting capabilities enable users to extrapolate trends and behaviors into the future so as to assist with use cases such as capacity planning. Chapter 5, Interpreting Results, explains how to fully understand the results of anomaly detection and forecasting and use them to your advantage in visualizations, dashboards, and infographics. Chapter 6, Alerting on ML Analysis, explains the different techniques for integrating the proactive notification capability of Elastic alerting with the insights uncovered by machine learning in order to make anomaly detection even more actionable. Chapter 7, AIOps and Root Cause Analysis, explains how leveraging Elastic machine learning to holistically inspect and analyze data from disparate data sources into correlated views gives the analyst a leg up in terms of legacy approaches. Chapter 8, Anomaly Detection in other Elastic Stack Apps, explains how anomaly detection is leveraged by other apps within the Elastic Stack to bring added value to data analysis. Chapter 9, Introducing Data Frame Analysis, covers the concepts of data frame analytics, how it is different from time series anomaly detection, and what tools are available to the user to load, prepare, transform, and analyze data with Elastic machine learning. Chapter 10, Outlier Detection covers the outlier detection analysis capabilities of data frame analytics along with Elastic machine learning. Chapter 11, Classification Analysis, covers the classification analysis capabilities of data frame analytics along with Elastic machine learning.
Preface
xi
Chapter 12, Regression covers the regression analysis capabilities of data frame analytics along with Elastic machine learning. Chapter 13, Inference, covers the usage of trained machine learning models for "inference" – to actually predict output values in an operationalized manner. Appendix: Anomaly Detection Tips, includes a variety of practical advice topics that didn't quite fit in other chapters. These useful tidbits will help you to get the most out of Elastic ML.
To get the most out of this book You will need a system with a good internet connection and an Elastic account.
Download the example code files You can download the example code files for this book from GitHub at https:// github.com/PacktPublishing/Machine-Learning-with-ElasticStack-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/ downloads/9781801070034_ColorImages.pdf.
Conventions used There are a number of text conventions used throughout this book. Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The analysis can also be split along categorical fields by setting partition_field_name."
A block of code is set as follows: 18/05/2020 15:16:00 DB Not Updated [Master] Table
xii
Preface
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: export DATABRICKS_AAD_TOKEN=
Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Let's now click the View results button to investigate in detail what the anomaly detection job has found in the data." Tips or important notes Appear like this.
Get in touch Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected]. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details. Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material. If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors. packtpub.com.
Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about Packt, please visit packt.com.
Section 1 – Getting Started with Machine Learning with Elastic Stack This section provides an intuitive understanding of the way Elastic ML works – from the perspective of not only what the algorithms are doing but also the logistics of the operation of the software within the Elastic Stack. This section covers the following chapters: • Chapter 1, Machine Learning for IT • Chapter 2, Enabling and Operationalization
1
Machine Learning for IT A decade ago, the idea of using machine learning (ML)-based technology in IT operations or IT security seemed a little like science fiction. Today, however, it is one of the most common buzzwords used by software vendors. Clearly, there has been a major shift in both the perception of the need for the technology and the capabilities that the state-of-the-art implementations of the technology can bring to bear. This evolution is important to fully appreciate how Elastic ML came to be and what problems it was designed to solve. This chapter is dedicated to reviewing the history and concepts behind how Elastic ML works. It also discusses the different kinds of analysis that can be done and the kinds of use cases that can be solved. Specifically, we will cover the following topics: • Overcoming the historical challenges in IT • Dealing with the plethora of data • The advent of automated anomaly detection • Unsupervised versus supervised ML • Using unsupervised ML for anomaly detection • Applying supervised ML to data frame analytics
4
Machine Learning for IT
Overcoming the historical challenges in IT IT application support specialists and application architects have a demanding job with high expectations. Not only are they tasked with moving new and innovative projects into place for the business, but they also have to keep currently deployed applications up and running as smoothly as possible. Today's applications are significantly more complicated than ever before—they are highly componentized, distributed, and possibly virtualized/ containerized. They could be developed using Agile, or by an outsourced team. Plus, they are most likely constantly changing. Some DevOps teams claim they can typically make more than 100 changes per day to a live production system. Trying to understand a modern application's health and behavior is like a mechanic trying to inspect an automobile while it is moving. IT security operations analysts have similar struggles in keeping up with day-to-day operations, but they obviously have a different focus of keeping the enterprise secure and mitigating emerging threats. Hackers, malware, and rogue insiders have become so ubiquitous and sophisticated that the prevailing wisdom is that it is no longer a question of whether an organization will be compromised—it's more of a question of when they will find out about it. Clearly, knowing about a compromise as early as possible (before too much damage is done) is preferable to learning about it for the first time from law enforcement or the evening news. So, how can they be helped? Is the crux of the problem that application experts and security analysts lack access to data to help them do their job effectively? Actually, in most cases, it is the exact opposite. Many IT organizations are drowning in data.
Dealing with the plethora of data IT departments have invested in monitoring tools for decades, and it is not uncommon to have a dozen or more tools actively collecting and archiving data that can be measured in terabytes, or even petabytes, per day. The data can range from rudimentary infrastructureand network-level data to deep diagnostic data and/or system and application log files. Business-level key performance indicators (KPIs) could also be tracked, sometimes including data about the end user's experience. The sheer depth and breadth of data available, in some ways, is the most comprehensive than it has ever been. To detect emerging problems or threats hidden in that data, there have traditionally been several main approaches to distilling the data into informational insights:
The advent of automated anomaly detection
5
• Filter/search: Some tools allow the user to define searches to help trim down the data into a more manageable set. While extremely useful, this capability is most often used in an ad hoc fashion once a problem is suspected. Even then, the success of using this approach usually hinges on the ability for the user to know what they are looking for and their level of experience—both with prior knowledge of living through similar past situations and expertise in the search technology itself. • Visualizations: Dashboards, charts, and widgets are also extremely useful to help us understand what data has been doing and where it is trending. However, visualizations are passive and require being watched for meaningful deviations to be detected. Once the number of metrics being collected and plotted surpasses the number of eyeballs available to watch them (or even the screen real estate to display them), visual-only analysis becomes less and less useful. • Thresholds/rules: To get around the requirement of having data be physically watched in order for it to be proactive, many tools allow the user to define rules or conditions that get triggered upon known conditions or known dependencies between items. However, it is unlikely that you can realistically define all appropriate operating ranges or model all of the actual dependencies in today's complex and distributed applications. Plus, the amount and velocity of changes in the application or environment could quickly render any static rule set useless. Analysts find themselves chasing down many false positive alerts, setting up a boy who cried wolf paradigm that leads to resentment of the tools generating the alerts and skepticism of the value that alerting could provide. Ultimately, there needed to be a different approach—one that wasn't necessarily a complete repudiation of past techniques, but could bring a level of automation and empirical augmentation of the evaluation of data in a meaningful way. Let's face it, humans are imperfect—we have hidden biases and limitations of capacity for remembering information and we are easily distracted and fatigued. Algorithms, if used correctly, can easily make up for these shortcomings.
The advent of automated anomaly detection ML, while a very broad topic that encompasses everything from self-driving cars to game-winning computer programs, was a natural place to look for a solution. If you realize that most of the requirements of effective application monitoring or security threat hunting are merely variations on the theme of find me something that is different from normal, then the discipline of anomaly detection emerges as the natural place to begin using ML techniques to solve these problems for IT professionals.
6
Machine Learning for IT
The science of anomaly detection is certainly nothing new, however. Many very smart people have researched and employed a variety of algorithms and techniques for many years. However, the practical application of anomaly detection for IT data poses some interesting constraints that make the otherwise academically worthy algorithms inappropriate for the job. These include the following: • Timeliness: Notification of an outage, breach, or other significant anomalous situation should be known as quickly as possible to mitigate it. The cost of downtime or the risk of a continued security compromise is minimized if remedied or contained quickly. Algorithms that cannot keep up with the real-time nature of today's IT data have limited value. • Scalability: As mentioned earlier, the volume, velocity, and variation of IT data continue to explode in modern IT environments. Algorithms that inspect this vast data must be able to scale linearly with the data to be usable in a practical sense. • Efficiency: IT budgets are often highly scrutinized for wasteful spending, and many organizations are constantly being asked to do more with less. Tacking on an additional fleet of super-computers to run algorithms is not practical. Rather, modest commodity hardware with typical specifications must be able to be employed as part of the solution. • Generalizability: While highly specialized data science is often the best way to solve a specific information problem, the diversity of data in IT environments drives a need for something that can be broadly applicable across most use cases. Reusability of the same techniques is much more cost-effective in the long run. • Adaptability: Ever-changing IT environments will quickly render a brittle algorithm useless in no time. Training and retraining the ML model would only introduce yet another time-wasting venture that cannot be afforded. • Accuracy: We already know that alert fatigue from legacy threshold and rule-based systems is a real problem. Swapping one false alarm generator for another will not impress anyone. • Ease of use: Even if all of the previously mentioned constraints could be satisfied, any solution that requires an army of data scientists to implement it would be too costly and would be disqualified immediately. So, now we are getting to the real meat of the challenge—creating a fast, scalable, accurate, low-cost anomaly detection solution that everyone will use and love because it works flawlessly. No problem!
Unsupervised versus supervised ML
7
As daunting as that sounds, Prelert founder and CTO Steve Dodson took on that challenge back in 2010. While Dodson certainly brought his academic chops to the table, the technology that would eventually become Elastic ML had its genesis in the throes of trying to solve real IT application problems—the first being a pesky intermittent outage in a trading platform at a major London finance company. Dodson, and a handful of engineers who joined the venture, helped the bank's team use the anomaly detection technology to automatically surface only the needles in the haystacks that allowed the analysts to focus on the small set of relevant metrics and log messages that were going awry. The identification of the root cause (a failing service whose recovery caused a cascade of subsequent network problems that wreaked havoc) ultimately brought stability to the application and prevented the need for the bank to spend lots of money on the previous solution, which was an unplanned, costly network upgrade. As time passed, however, it became clear that even that initial success was only the beginning. A few years and a few thousand real-world use cases later, the marriage of Prelert and Elastic was a natural one—a combination of a platform making big data easily accessible and technology that helped overcome the limitations of human analysis. Fast forward to 2021, a full 5 years after the joining of forces, and Elastic ML has come a long way in the maturation and expansion of capabilities of the ML platform. This second edition of the book encapsulates the updates made to Elastic ML over the years, including the introduction of integrations into several of the Elastic solutions around observability and security. This second edition also includes the introduction of "data frame analytics," which is discussed extensively in the third part of the book. In order to get a grounded, innate understanding of how Elastic ML works, we first need to get to grips with some terminology and concepts to understand things further.
Unsupervised versus supervised ML While there are many subtypes of ML, two very prominent ones (and the two that are relevant to Elastic ML) are unsupervised and supervised. In unsupervised ML, there is no outside guidance or direction from humans. In other words, the algorithms must learn (and model) the patterns of the data purely on their own. In general, the biggest challenge here is to have the algorithms accurately surface detected deviations of the input data's normal patterns to provide meaningful insight for the user. If the algorithm is not able to do this, then it is not useful and is unsuitable for use. Therefore, the algorithms must be quite robust and able to account for all of the intricacies of the way that the input data is likely to behave.
8
Machine Learning for IT
In supervised ML, input data (often multivariate data) is used to help model the desired outcome. The key difference from unsupervised ML is that the human decides, a priori, what variables to use as the input and also provides "ground-truth" examples of the expected target variable. Algorithms then assess how the input variables interact and influence the known output target. To accurately get the desired output (prediction, for example), the algorithm must have "the right kind of data" not only that indeed expresses the situation, but also so that there is enough diversity of the input data in order to effectively learn the relationship between the input data and the output target. As such, both cases require good input data, good algorithmic approaches, and a good mechanism to allow the ML to both learn the behavior of the data and apply that learning to assess subsequent observations of that data. Let's dig a little deeper into the specifics of how Elastic ML leverages unsupervised and supervised learning.
Using unsupervised ML for anomaly detection To get a more intuitive understanding of how Elastic ML's anomaly detection works using unsupervised ML, we will discuss the following: • A rigorous definition of unusual with respect to the technology • An intuitive example of learning in an unsupervised manner • A description of how the technology models, de-trends, and scores the data
Defining unusual Anomaly detection is something almost all of us have a basic intuition about. Humans are quite good at pattern recognition, so it should be of no surprise that if I asked 100 people on the street what's unusual in the following graph, a vast majority (including non-technical people) would identify the spike in the green line:
Using unsupervised ML for anomaly detection
Figure 1.1 – A line graph showing an anomaly
Similarly, let's say we ask what's unusual in the following photo:
Figure 1.2 – A photograph showing a seal among penguins
9
10
Machine Learning for IT
We will, again, likely get a majority that rightly claims that the seal is the unusual thing. But people may struggle to articulate in salient terms the actual heuristics that are used in coming to those conclusions. There are two different heuristics that we could use to define the different kinds of anomalies shown in these images: • Something is unusual if its behavior has significantly deviated from an established pattern or range based upon its past history. • Something is unusual if some characteristic of that entity is significantly different from the same characteristic of the other members of a set or population. These key definitions will be relevant to Elastic ML's anomaly detection, as they form the two main fundamental modes of operation of the anomaly detection algorithms (temporal versus population analysis, as will be explored in Chapter 3, Anomaly Detection). As we will see, the user will have control over what mode of operation is employed for a particular use case.
Learning what's normal As we've stated, Elastic ML's anomaly detection uses unsupervised learning in that the learning occurs without anything being taught. There is no human assistance to shape the decisions of the learning; it simply does so on its own, via inspection of the data it is presented with. This is slightly analogous to the learning of a language via the process of immersion, as opposed to sitting down with books of vocabulary and rules of grammar. To go from a completely naive state where nothing is known about a situation to one where predictions could be made with good certainty, a model of the situation needs to be constructed. How this model is created is extremely important, as the efficacy of all subsequent actions taken based upon this model will be highly dependent on the model's accuracy. The model will need to be flexible and continuously updated based upon new information, because that is all that it has to go on in this unsupervised paradigm.
Probability models Probability distributions can serve this purpose quite well. There are many fundamental types of distributions (and Elastic ML uses a variety of distribution types, such as Poisson, Gaussian, log-normal, or even mixtures of models), but the Poisson distribution is a good one to discuss first, because it is appropriate in situations where there are discrete occurrences (the "counts") of things with respect to time:
Using unsupervised ML for anomaly detection
11
Figure 1.3 – A graph demonstrating Poisson distributions (source: https://en.wikipedia.org/wiki/ Poisson_distribution#/media/File:Poisson_pmf.svg)
There are three different variants of the distribution shown here, each with a different mean (λ) and the highest expected value of k. We can make an analogy that says that these distributions model the expected amount of postal mail that a person gets delivered to their home on a daily basis, represented by k on the x axis: • For λ = 1, there is about a 37% chance that zero pieces or one piece of mail are delivered daily. Perhaps this is appropriate for a college student that doesn't receive much postal mail. • For λ = 4, there is about a 20% chance that three or four pieces are received. This might be a good model for a young professional. • For λ = 10, there is about a 13% chance that 10 pieces are received per day—perhaps representing a larger family or a household that has somehow found themselves on many mailing lists! The discrete points on each curve also give the likelihood (probability) of other values of k. As such, the model can be informative and answer questions such as "Is getting 15 pieces of mail likely?" As we can see, it is not likely for a student (λ = 1) or a young professional (λ = 4), but it is somewhat likely for a large family (λ = 10). Obviously, there was a simple declaration made here that the models shown were appropriate for the certain people described—but it should seem obvious that there needs to be a mechanism to learn that model for each individual situation, not just assert it. The process for learning it is intuitive.
12
Machine Learning for IT
Learning the models Sticking with the postal mail analogy, it would be instinctive to realize that a method of determining what model is the best fit for a particular household could be ascertained simply by hanging out by the mailbox every day and recording what the postal carrier drops into the mailbox. It should also seem obvious that the more observations made, the higher your confidence should be that your model is accurate. In other words, only spending 3 days by the mailbox would provide less complete information and confidence than spending 30 days, or 300 for that matter. Algorithmically, a similar process could be designed to self-select the appropriate model based upon observations. Careful scrutiny of the algorithm's choices of the model type itself (that is, Poisson, Gaussian, log-normal, and so on) and the specific coefficients of that model type (as in the preceding example of λ) would also need to be part of this self-selection process. To do this, constant evaluation of the appropriateness of the model is done. Bayesian techniques are also employed to assess the model's likely parameter values, given the dataset as a whole, but allowing for tempering of those decisions based upon how much information has been seen prior to a particular point in time. The ML algorithms accomplish this automatically. Note For those that want a deeper dive into some of the representative mathematics going on behind the scenes, please refer to the academic paper at http:// www.ijmlc.org/papers/398-LC018.pdf.
Most importantly, the modeling that is done is continuous, so that new information is considered along with the old, with an exponential weighting given to information that is fresher. Such a model, after 60 observations, could resemble the following:
Using unsupervised ML for anomaly detection
Figure 1.4 – Sample model after 60 observations
It will then seem very different after 400 observations, as the data presents itself with a slew of new observations with values between 5 and 10:
Figure 1.5 – Sample model after 400 observations
13
14
Machine Learning for IT
Also, notice that there is the potential for the model to have multiple modes or areas/ clusters of higher probability. The complexity and trueness of the fit of the learned model (shown as the blue curve) with the theoretically ideal model (in black) matters greatly. The more accurate the model, the better representation of the state of normal for that dataset and thus, ultimately, the more accurate the prediction of how future values comport with this model. The continuous nature of the modeling also drives the requirement that this model is capable of serialization to long-term storage, so that if model creation/analysis is paused, it can be reinstated and resumed at a later time. As we will see, the operationalization of this process of model creation, storage, and utilization is a complex orchestration, which is fortunately handled automatically by Elastic ML.
De-trending Another important aspect of faithfully modeling real-world data is to account for prominent overtone trends and patterns that naturally occur. Does the data ebb and flow hourly and/or daily with more activity during business hours or business days? If so, this needs to be accounted for. Elastic ML automatically hunts for prominent trends in the data (linear growth, cyclical harmonics, and so on) and factors them out. Let's observe the following graph:
Figure 1.6 – Periodicity detection in action
Here, the periodic daily cycle is learned, then factored out. The model's prediction boundaries (represented in the light-blue envelope around the dark-blue signal) dramatically adjust after automatically detecting three successive iterations of that cycle.
Using unsupervised ML for anomaly detection
15
Therefore, as more data is observed over time, the models gain accuracy both from the perspective of the probability distribution function getting more mature, as well as via the auto-recognizing and de-trending of other routine patterns (such as business days, weekends, and so on) that might not emerge for days or weeks. In the following example, several trends are discovered over time, including daily, weekly, and an overall linear slope:
Figure 1.7 – Multiple trends being detected
These model changes are recorded as system annotations. Annotations, as a general concept, will be covered in later chapters.
16
Machine Learning for IT
Scoring of unusualness Once a model has been constructed, the likelihood of any future observed value can be found within the probability distribution. Earlier, we had asked the question "Is getting 15 pieces of mail likely?" This question can now be empirically answered, depending on the model, with a number between 0 (no possibility) and 1 (absolute certainty). Elastic ML will use the model to calculate this fractional value out to approximately 300 significant figures (which can be helpful when dealing with very low probabilities). Let's observe the following graph:
Figure 1.8 – Anomaly scoring
Applying supervised ML to data frame analytics
17
Here, the probability of the observation of the actual value of 921 is now calculated to be 1.444e-9 (or, more commonly, a mere 0.0000001444% chance). This very small value is perhaps not that intuitive to most people. As such, ML will take this probability calculation, and via the process of quantile normalization, re-cast that observation on a severity scale between 0 and 100, where 100 is the highest level of unusualness possible for that particular dataset. In the preceding case, the probability calculation of 1.444e-9 is normalized to a score of 94. This normalized score will come in handy later as a means by which to assess the severity of the anomaly for the purposes of alerting and/or triage.
The element of time In Elastic ML, all of the anomaly detection that we will discuss throughout the rest of the book will have an intrinsic element of time associated with the data and analysis. In other words, for anomaly detection, Elastic ML expects the data to be time series data and that data will be analyzed in increments of time. This is a key point and also helps discriminate between anomaly detection and data frame analytics in addition to the unsupervised/ supervised paradigm. You will see that there's a slight nuance with respect to population analysis (covered in Chapter 3, Anomaly Detection) and outlier detection (covered in Chapter 10, Outlier Detection While they effectively both find entities that are distinctly different from their peers, population analysis in anomaly detection does so with respect to time, whereas outlier detection analysis isn't constrained by time. More will become obvious as these topics are covered in depth in later chapters.
Applying supervised ML to data frame analytics With the exception of outlier detection (covered in Chapter 10, Outlier Detection which actually is an unsupervised approach, the rest of data frame analytics uses a supervised approach. Specifically, there are two main types of problems that Elastic ML data frame analytics allows you to address: • Regression: Used to predict a continuous numerical value (a price, a duration, a temperature, and so on) • Classification: Used to predict whether something is of a certain class label (fraudulent transaction versus non-fraudulent, and more)
18
Machine Learning for IT
In both cases, models are built using training data to map input variables (which can be numerical or categorical) to output predictions by training decision trees. The particular implementation used by Elastic ML is a custom variant of XGBoost, an open source gradient-boosted decision tree framework that has recently gained some notoriety among data scientists for its ability to allow them to win Kaggle competitions.
The process of supervised learning The overall process of supervised ML is very different from the unsupervised approach. In the supervised approach, you distinctly separate the training stage from the predicting stage. A very simplified version of the process looks like the following:
Figure 1.9 – Supervised ML process
Here, we can see that in the training phase, features are extracted out of the raw training data to create a feature matrix (also called a data frame) to feed to the ML algorithm and create the model. The model can be validated against portions of the data to see how well it did, and subsequent refinement steps could be made to adjust which features are extracted, or to refine the parameters of the ML algorithm used to improve the accuracy of the model's predictions. Once the user decides that the model is efficacious, that model is "moved" to the prediction workflow, where it is used on new data. One at a time, a single new feature vector is inferenced against the model to form a prediction.
Summary
19
To get an intuitive sense of how this works, imagine a scenario in which you want to sell your house, but don't know what price to list it for. You research prior sales in your area and notice the price differentials for homes based on different factors (number of bedrooms, number of bathrooms, square footage, proximity to schools/shopping, age of home, and so on). Those factors are the "features" that are considered altogether (not individually) for every prior sale. This corpus of historical sales is your training data. It is helpful because you know for certain how much each property sold for (and that's the thing you'd ultimately like to predict for your house). If you study this enough, you might get an intuition about how the prices of houses are driven strongly by some features (for instance, the number of bedrooms) and that other features (perhaps the age of the home) may not affect the pricing much. This is a concept called "feature importance" that will be visited again in a later chapter. Armed with enough training data, you might have a good idea what the value of your home should be priced at, given that it is a three-bedroom, two-bath, 1,700-square-foot, 30-year-old home. In other words, you've constructed a model in your mind based on your research of comparable homes that have sold in the last year or so. If the past sales are the "training data," your home's specifications (bedrooms, bathrooms, and so on) are the feature vectors that will define the expected price, given your "model" that you've learned. Your simple mental model is obviously not as rigorous as one that could be constructed with regression analysis using ML using dozens of relevant input features, but this simple analogy hopefully cements the idea of the process that is followed in learning from prior, known situations, and then applying that knowledge to a present, novel situation.
Summary To summarize what we discussed in this chapter, we covered the genesis story of ML in IT—born out of the necessity to automate analysis of the massive, ever-expanding growth of collected data within enterprise environments. We also got a more intuitive understanding of the different types of ML in Elastic ML, which includes both unsupervised anomaly detection and supervised data frame analysis. As we journey through the rest of the chapters, we will often be mapping the use cases of the problems we're trying to solve to the different modes of operation of Elastic ML.
20
Machine Learning for IT
Remember that if the data is a time series, meaning that it comes into existence routinely over time (metric/performance data, log files, transactions, and so on), it is quite possible that Elastic ML's anomaly detection is all you'll ever need. As you'll see, it is incredibly flexible and easy to use and accomplishes many use cases on a broad variety of data. It's kind of a Swiss Army knife! A large amount of this book (Chapters 3 through 8) will be devoted to how to leverage anomaly detection (and the ancillary capability of forecasting) to get the most out of your time series data that is in the Elastic Stack. If you are more interested in finding unusual entities within a population/cohort (User/ Entity Behavior), you might have a tricky decision between using population analysis in anomaly detection versus outlier detection in data frame analytics. The primary factor may be whether or not you need to do this in near real time—in which case you might likely choose population analysis. If near real time is not necessary and/or if you require the consideration of multiple features simultaneously, you would choose outlier detection. See Chapter 10, for more detailed information about the comparison and benefits of each approach. That leaves many other use cases that require a multivariate approach to modeling. This would not only align with the previous example of real estate pricing but also encompass the use cases of language detection, customer churn analysis, malware detection, and so on. These will fall squarely in the realm of the supervised ML of data frame analytics and be covered in Chapters 11 through 13. In the next chapter, we will get down and dirty with understanding how to enable Elastic ML and how it works in an operational sense. Buckle up and enjoy the ride!
2
Enabling and Operationalization We have just learned the basics of what Elastic ML is doing to accomplish both unsupervised automated anomaly detection and supervised data frame analysis. Now it is time to get detailed about how Elastic ML works inside the Elastic Stack (Elasticsearch and Kibana). This chapter will focus on both the installation (really, the enablement) of Elastic ML features and a detailed discussion of the logistics of the operation, especially with respect to anomaly detection. Specifically, we will cover the following topics: • Enabling Elastic ML features • Understanding operationalization
Technical requirements The information in this chapter will use the Elastic Stack as it exists in v7.10 and the workflow of the Elasticsearch Service of Elastic Cloud as of November 2020.
22
Enabling and Operationalization
Enabling Elastic ML features The process for enabling Elastic ML features inside the Elastic Stack is slightly different if you are doing so within a self-managed cluster versus using the Elasticsearch Service (ESS) of Elastic Cloud. In short, on a self-managed cluster, the features of ML are enabled via a license key (either a commercial key or a trial key). In ESS, a dedicated ML node needs to be provisioned within the cluster in order to utilize Elastic ML. In the following sections, we will explain the details of how this is accomplished in both scenarios.
Enabling ML on a self-managed cluster If you have a self-managed cluster that was created from the downloading of Elastic's default distributions of Elasticsearch and Kibana (available at elastic.co/ downloads/), enabling Elastic ML features via a license key is very simple. Be sure to not use the Apache 2.0 licensed open source distributions that do not contain the X-Pack code base. Elastic ML, unlike the bulk of the capabilities of the Elastic Stack, is not free – it requires a commercial (specifically, a Platinum level) license. It is, however, open source in that the source code is out in the open on GitHub (github.com/elastic/ml-cpp) and that users can look at the code, file issues, make comments, or even execute pull requests. However, the usage of Elastic ML is governed by a commercial agreement with Elastic, the company. When Elastic ML was first released (back in the v5.x days), it was part of the closed source features known as X-Pack that required a separate installation step. However, as of version 6.3, the code of X-Pack was "opened" (elastic.co/what-is/open-x-pack) and folded into the default distribution of Elasticsearch and Kibana. Therefore, a separate X-Pack installation step was no longer necessary, just the "enablement" of the features via a commercial license (or a trial license). The installation procedure for Elasticsearch and Kibana itself is beyond the scope of this book, but it is easily accomplished by following the online documentation on the Elastic website (available at elastic.co/guide/).
Enabling Elastic ML features
23
Once Elasticsearch and Kibana are running, navigate to the Stack option from the left-side navigation menu and select License Management. You will see a screen like the following:
Figure 2.1 – The License management screen in Kibana
Notice that, by default, the license level applied is the free Basic tier. This enables you to use some of the advanced features not found in the Apache 2.0 licensed open source distribution, or on third-party services (such as the Amazon Elasticsearch Service). A handy guide for comparing the features that exist at the different license levels can be found on the Elastic website at elastic.co/subscriptions.
24
Enabling and Operationalization
As previously stated, Elastic ML requires a Platinum tier license. If you have purchased a Platinum license from Elastic, you can apply that license by clicking on the Update license button, as shown on the screen in Figure 2.1. If you do not have a Platinum license, you can start a free 30-day trial by clicking the Start my trial button to enable Elastic ML and the other Platinum features (assuming you agree to the license terms and conditions):
Figure 2.2 – Starting a free 30-day trial
Enabling Elastic ML features
25
Once this is complete, the licensing screen will indicate that you are now in an active trial of the Platinum features of the Elastic Stack:
Figure 2.3 – Trial license activated
Once this is done, you can start to use Elastic ML right away. Additional configuration steps are needed to take advantage of the other Platinum features, but those steps are outside the scope of this book. Consult the Elastic documentation for further assistance on configuring those features.
26
Enabling and Operationalization
Enabling ML in the cloud – Elasticsearch Service If downloading, installing, and self-managing the Elastic Stack is less interesting than just getting the Elastic Stack platform offered as a service, then head on over to Elastic Cloud (cloud.elastic.co) and sign up for a free trial, using only your email:
Figure 2.4 – Elastic Cloud welcome screen
Enabling Elastic ML features
27
You can then perform the following steps: 1. Once inside the Elastic Cloud interface after logging in, you will have the ability to start a free trial by clicking the Start your free trial button:
Figure 2.5 – Elastic Cloud home screen
28
Enabling and Operationalization
Once the button is clicked, you will see that your 14-day free trial of ESS has started:
Figure 2.6 – Elasticsearch Service trial enabled
2. Of course, in order to try out Elastic ML, you first need an Elastic Stack cluster provisioned. There are a few options to create what ESS refers to as deployments, with some that are tailored to specific use cases. For this example, we will use the Elastic Stack template on the left of Figure 2.6 and choose the I/O Optimized hardware profile, but feel free to experiment with the other options during your trial:
Enabling Elastic ML features
29
Figure 2.7 – Creating an ESS deployment
3. You can also choose what cloud provider and which region to start your cluster in, but most importantly, if you want to use ML features, you must enable an ML node by first clicking on the Customize button near the bottom-right corner.
30
Enabling and Operationalization
4. After clicking the Customize button, you will see a new screen that allows you to add an ML node:
Figure 2.8 – Customizing deployment to add an ML node
Enabling Elastic ML features
5. Near the bottom of Figure 2.8 is a link to Add Machine Learning nodes to your cluster. Clicking on this will reveal the ML node configuration:
Figure 2.9 – Adding ML node(s)
Note During the free 14-day trial period of ESS, you can only add one 1 GB ML node (in one or two availability zones). If you move from a free trial to a paid subscription, you can obviously create more or larger ML nodes.
31
32
Enabling and Operationalization
6. Once the ML node is added to the configuration, click on the Create Deployment button to initiate the process for ESS to create your cluster for you, which will take a few minutes. In the meantime, you will be shown the default credentials that you will use to access the cluster:
Figure 2.10 – Default assigned credentials
You can download these credentials for use later. Don't worry if you forgot to download them – you can always reset the password later if needed.
Enabling Elastic ML features
33
7. Once the cluster is up and running as shown in Figure 2.11 (usually only after a few minutes), you will see the following view of your deployment, with an Open Kibana button that will allow you to launch into your deployment:
Figure 2.11 – Deployment successfully created
Once the Open Kibana button is clicked, you will be automatically authenticated into Kibana, where you will be ready to use ML straight away – no additional configuration steps are necessary.
34
Enabling and Operationalization
At this point, from the perspective of the user who wants to use Elastic ML, there is little difference between the self-managed configuration shown earlier and the setup created in ESS. The one major difference, however, is that the configuration here in ESS has Elastic ML always isolated to a dedicated ML node. In a self-managed configuration, ML nodes can be dedicated or in a shared role (such as data, ingest, and ml roles all on the same node). We will discuss this concept later in this chapter. Now that we have a functioning Elastic Stack with ML enabled, we are getting closer to being able to start analyzing data, which will begin in Chapter 3, Anomaly Detection. But first, let's understand the operationalization of Elastic ML.
Understanding operationalization At some point on your journey with using Elastic ML, it will be helpful to understand a number of key concepts regarding how Elastic ML is operationalized within the Elastic Stack. This includes information about how the analytics run on the cluster nodes and how data that is to be analyzed by ML is retrieved and processed. Note Some concepts in this section may not be intuitive until you actually start using Elastic ML on some real examples. Don't worry if you feel like you prefer to skim (or even skip) this section now and return to it later following some genuine experience of using Elastic ML.
ML nodes First and foremost, since Elasticsearch is, by nature, a distributed multi-node solution, it is only natural that the ML feature of the Elastic Stack works as a native plugin that obeys many of the same operational concepts. As described in the documentation (elastic. co/guide/en/elasticsearch/reference/current/ml-settings.html), ML can be enabled on any or all nodes, but it is a best practice in a production system to have dedicated ML nodes. We saw this best practice forced on the user in Elastic Cloud ESS – the user must create dedicated ML nodes if ML is desired to be used. Having dedicated ML nodes is also helpful in optimizing the types of resources specifically required by ML. Unlike data nodes that are involved in a fair amount of disk I/O loads due to indexing and searching, ML nodes are more compute- and memory-intensive. With this knowledge, you can size the hardware appropriately for dedicated ML nodes.
Understanding operationalization
35
One key thing to note—the ML algorithms do not run in the Java Virtual Machine (JVM). They are C++-based executables that will use the RAM that is left over from whatever is allocated for the JVM heap. When running ML jobs, a process that invokes the analysis (called autodetect for anomaly detection and data_frame_analyzer for data frame analytics) can be seen in the process list (if you were to run the ps command on Linux, for example). There will be one process for every actively running ML job. In multi-node setups, ML will distribute the jobs to each of the ML-enabled nodes to balance the load of the work. Elastic ML obeys a setting called xpack.ml.max_machine_memory_percent, which governs how much system memory can be used by ML jobs. The default value of this setting is 30%. The limit is based on the total memory of the machine, not memory that is currently free. Don't forget that the Elasticsearch JVM may take up to around 50% of the available machine memory, so leaving 30% to ML and the remaining 20% for the operating system and other ancillary processes is prudent, albeit conservative. Jobs are not allocated to a node if doing so would cause the estimated memory use of ML jobs to exceed the limit defined by this setting. While there is no empirical formula to determine the size and number of dedicated ML nodes, some good rules of thumb are as follows: • Have one dedicated ML node (two for high availability/fault tolerance if a single node becomes unavailable) for cluster sizes of up to 10 data nodes. • Have at least two ML nodes for clusters of up to 20 nodes. • Add an additional ML node for every additional 10 data nodes. This general approach of reserving about 10-20% of your cluster capacity to dedicated ML nodes is certainly a reasonable suggestion, but it does not obviate the need to do your own sizing, characterization testing, and resource monitoring. As we will see in several later chapters, the resource demands on your ML tasks will greatly depend on what kind(s) of analyses are being invoked, as well as the density and volume of the data being analyzed.
Jobs In Elastic ML, the job is the unit of work. There are both anomaly detection jobs and data frame analytics jobs. Both take some kind of data as input and produce new information as output. Jobs can be created using the ML UI in Kibana, or programmatically via the API. They also require ML-enabled nodes. In general, anomaly detection jobs can be run as a single-shot batch analysis (over a swath of historical data) or continuously run in real time on time series data – data that is constantly being indexed by your Elastic Stack (or both, really).
36
Enabling and Operationalization
Alternatively, data frame analytics jobs are not continuous – they are single-shot executions that produce output results and/or an output model that is used for subsequent inferencing, discussed in more depth in chapters 9 to 13. Therefore, from an operationalization standpoint, anomaly detection jobs are a bit more complex – as multiple can be running simultaneously, doing independent things and analyzing data from different indices. In other words, anomaly detection jobs are likely to be continuously busy within a typical cluster. As we will see in more depth later, the main configuration elements of an anomaly detection job are as follows: • Job name/ID • Analysis bucketization window (the bucket span) • The definition and settings for the query to obtain the raw data to be analyzed (the datafeed) • The anomaly detection configuration recipe (the detector) With the notion of jobs understood, we'll next focus on how the bucketing of time series data is an important concept in the analysis of real-time data.
Bucketing data in a time series analysis Bucketing input data is an important concept to understand in Elastic ML's anomaly detection. Set with a key parameter at the job level called bucket_span, the input data from the datafeed (described next) is collected into mini batches for processing. Think of the bucket span as a pre-analysis aggregation interval—the window of time in which a portion of the data is aggregated over for the purposes of analysis. The shorter the duration of bucket_span, the more granular the analysis, but also the higher the potential for noisy artifacts in the data. To illustrate, the following graph shows the same dataset aggregated over three different intervals:
Understanding operationalization
37
Figure 2.12 – Aggregations of the same data over different time intervals
Notice that the prominent anomalous spike seen in the version aggregated over the 5minute interval becomes all but lost if the data is aggregated over a 60-minute interval due to the fact of the spike's short ( icon next to the timestamp for the document in Figure 5.16 will expand it so that you can see all of the details:
Results index schema details
Figure 5.17 – Bucket-level document detail in Kibana Discover
143
144
Interpreting Results
You can see that just one bucket-level document was returned from our query in Figure 5.16, a single anomalous time bucket (at timestamp 1613824200000, or in my time zone, February 20, 2021, 07:30:00 A.M. GMT-05:00) that has an anomaly_score greater than 98. In other words, there were no other time buckets with anomalies that big in this time range. Let's look at the key fields: • timestamp: The timestamp of the leading edge of the time bucket. In Kibana, this field will be displayed by default in your local time zone (although it is stored in the index in epoch format with the time zone of UTC). • anomaly_score: The current normalized score of the bucket, based upon the range of probabilities seen over the entirety of the job. The value of this score may fluctuate over time as new data is processed by the job and new anomalies are found. • initial_anomaly_score: The normalized score of the bucket, that is, when that bucket was first analyzed by the analytics. This score, unlike anomaly_score, will not change as more data is analyzed. • event_count: The number of raw Elasticsearch documents seen by the ML algorithms during the bucket's span. • is_interim: A flag that signifies whether the bucket is finalized or whether is still waiting for all of the data within the bucket span to be received. This field is relevant for ongoing jobs that are operating in real time. For certain types of analysis, there could be interim results, even though not all of the data for the bucket has been seen. • job_id: The name of the anomaly detection job that created this result. • processing_time_ms: An internal performance measurement of how much processing time (in milliseconds) the analytics took to process this bucket's worth of data. • bucket_influencers: An array of influencers (and details on them) that have been identified for this current bucket. Even if no influencers have been chosen as part of the job configuration, or there are no influencers as part of the analysis, there will always be a default influencer of the influencer_field_name:bucket_ time type, which is mostly an internal record-keeping device to allow for the ordering of bucket-level anomalies in cases where explicit influencers cannot be determined. If a job does have named and identified influencers, then the bucket_influencers array may look like what is shown in Figure 5.17.
Results index schema details
145
Notice that in addition to the default entry of the influencer_field_ name:bucket_time type, in this case, there is an entry for a field name of an analyticsidentified influencer for the geo.src field. This is a cue that geo.src was a relevant influencer type that was discovered at the time of this anomaly. Since multiple influencer candidates can be chosen in the job configuration, it should be noted that in this case, geo.src is the only influencer field and no other fields were found to be influential. It should also be noted that, at this level of detail, the particular instance of geo.src (that is, which one) is not disclosed; that information will be disclosed when querying at the lower levels of abstraction, which we will discuss next.
Record results At a lower level of abstraction, there are results at the record level. Giving the most amount of detail, record results show specific instances of anomalies and essentially answer the question, "What entity was unusual and by how much?" Let's look at an example document in the .ml-anomalies-* index by using Kibana Discover and issuing the following KQL query: result_type :"record" and record_score >98
This will result in something similar to this:
Figure 5.18 – Record-level result document as seen in Kibana Discover
146
Interpreting Results
Clicking on the > icon next to the timestamp for the document will expand it so that you are able to see all of the details:
Figure 5.19 – Record-level document detail in Kibana Discover
You can see that a few bucket-level documents were returned from our query in Figure 5.18. Let's look at the key fields: • timestamp: The timestamp of the leading edge of the time bucket, inside which this anomaly occurred. This is similar as explained earlier. • job_id: The name of the anomaly detection job that created this result.
Results index schema details
147
• record_score: The current normalized score of the anomaly record, based upon the range of the probabilities seen over the entirety of the job. The value of this score may fluctuate over time as new data is processed by the job and new anomalies are found. • initial_record_score: The normalized score of the anomaly record, that is, when that bucket was first analyzed by the analytics. This score, unlike record_ score, will not change as more data is analyzed. • detector_index: An internal counter to keep track of the detector configuration that this anomaly belongs to. Obviously, with a single-detector job, this value will be zero, but it may be non-zero in jobs with multiple detectors. • function: A reference to keep track of which detector function was used for the creation of this anomaly. • is_interim: A flag that signifies whether or not the bucket is finalized or whether the bucket is still waiting for all of the data within the bucket span to be received. This field is relevant for ongoing jobs that are operating in real time. For certain types of analysis, there could be interim results, even though not all of the data for the bucket has been seen. • actual: The actual observed value of the analyzed data in this bucket. For example, if the function is count, then this represents the number of documents that are encountered (and counted) in this time bucket. • typical: A representation of the expected or predicted value based upon the ML model for this dataset. • multi_bucket_impact: A measurement (on a scale from -5 to +5) that determines how much this particular anomaly was influenced by the secondary multi-bucket analysis (explained later in the chapter), from no influence (-5) to all influence (+5). • influencers: An array of which influencers (and the values of those influencers) are relevant to this anomaly record. If a job has splits defined (either with by_field_name and/or partition_field_ name) and identified influencers, then the record results documents will have more information, such as what is seen in Figure 5.19: • partition_field_name: A cue that a partition field was defined and that an anomaly was found for one of the partition field values. • partition_field_value: The value of the partition field that this anomaly occurred for. In other words, the entity name this anomaly was found for.
148
Interpreting Results
In addition to the fields mentioned here (which would have been by_field_name and by_field_value if the job had been configured to use a by field), we also see an explicit instance of the geo.src field. This is just a shortcut – every partition, by, or over_field_value in the results will also have a direct field name. If your job is doing population analysis (via the use of over_field_name), then the record results document will be organized slightly differently as the reporting is done with orientation as to the unusual members of the population. For example, if we look at a population analysis job on the kibana_sample_data_logs index in which we choose distinct_count("url.keyword") over clientip as the detector, then an example record-level results document will also contain a causes array:
Figure 5.20 – Record-level document showing the causes array for a population job
The causes array is built to compactly express all of the anomalous things that that IP did in that bucket. Again, many things seem redundant, but it is primarily because there may be different ways of aggregating the information for results presentation in dashboards or alerts. Also, in the case of this population analysis, we see that the influencers array contains both the clientip field and the response.keyword field:
Figure 5.21 – Record-level document showing the influencers array for a population job
Results index schema details
149
Let's conclude our survey of the results index schema by looking at the influencers-level results.
Influencer results Yet another lens by which to view the results is via influencers. Viewing the results this way allows us to answer the question, "What were the most unusual entities in my ML job and when were they unusual?" To understand the structure and content of influencer-level results, let's look at an example document in the .ml-anomalies-* index by using Kibana Discover and issuing the following KQL query: result_type :"influencer" and response.keyword:404
Figure 5.22 – Influencer-level result document as seen in Kibana Discover
Notice that in this case, we didn't query on the score (influencer_score), but rather on an expected entity name and value. The last document listed (with an influencer_score of 50.174, matches what we saw back in Figure 5.13.
150
Interpreting Results
Let's look at the key fields: • timestamp: The timestamp of the leading edge of the time bucket, inside which this influencer's anomalous activity occurred. This is similar to what was explained earlier. • job_id: The name of the anomaly detection job that created this result. • influencer_field_name: The name of the field that was declared as an influencer in the job configuration. • influencer_field_value: The value of the influencer field for which this result is relevant. • influencer_score: The current normalized score of how unusual and contributory the influencer was to anomalies at this point. • initial_influencer_score: The normalized score of the influencer when that bucket was first analyzed by the analytics. This score, unlike influencer_ score, will not change as more data is analyzed. • is_interim: A flag that signifies whether or not the bucket is finalized or whether the bucket is still waiting for all of the data within the bucket span to be received. This field is relevant for ongoing jobs that are operating in real time. For certain types of analysis, there could be interim results, even though not all of the data for the bucket has been seen. Now that we have exhaustively explained the relevant fields that are available to the user, we can file that information away for when we build custom dashboards, visualizations, and sophisticated alerting in subsequent sections and chapters. But, before we exit this chapter, we still have a few important concepts to explore. Next up is a discussion on a special kind of anomaly – the multi-bucket anomaly.
Multi-bucket anomalies Almost everything that we've studied so far with anomalies being generated by Elastic ML's anomaly detection jobs has been with respect to looking at a specific anomaly being raised at a specific time, but quantized at the interval of bucket_span. However, we can certainly have situations in which a particular observation within a bucket span may not be that unusual, but an extended window of time, taken collectively together, might be more significantly unusual than any single observation. Let's see an example.
Multi-bucket anomalies
151
Multi-bucket anomaly example First shown in the example in Chapter 3, Anomaly Detection, in Figure 3.17, we repeat the figure here to show how multi-bucket anomalies exhibit themselves in the Elastic ML UI:
Figure 5.23 – Multi-bucket anomalies first shown in Chapter 3
As we discussed in Chapter 3, Anomaly Detection, multi-bucket anomalies are designated with a different symbol in the UI (a cross instead of a dot). They denote cases in which the actual singular value may not necessarily be anomalous, but that there is a trend that is occurring in a sliding window of 12 consecutive buckets. Here, you can see that there is a noticeable slump spanning several adjacent buckets.
152
Interpreting Results
Note, however, that some of the multi-bucket anomaly markers are placed on the data at times after the data has "recovered." This can be somewhat confusing to users until you realize that because the determination of multi-bucket anomalies is a secondary analysis (in addition to the bucket-by-bucket analysis) and because this analysis is a sliding window looking in arrears, the leading edge of that window, when the anomaly is recorded, might be after the situation has recovered.
Multi-bucket scoring As mentioned, multi-bucket analysis is a secondary analysis. Therefore, two probabilities are calculated for each bucket span – the probability of the observation seen in the current bucket, and the probability of a multi-bucket feature – a kind of weighted average of the current bucket and the previous 11. If those two probabilities are roughly the same order of magnitude, then multi_bucket_impact will be low (on the negative side of the -5 to +5 scale). If, on the other hand, the multi-bucket feature probability is wildly lower (thus more unusual), then multi_bucket_impact will be high. In the example shown in Figure 5.23, the UI will show the user the multi-bucket impact as being high, but will not give you the actual scoring:
Figure 5.24 – Multi-bucket anomalies, with impact scoring shown
Multi-bucket anomalies
153
However, if you look at the raw record-level result, you will see that multi_bucket_impact has indeed been given a value of +5:
Figure 5.25 – Multi-bucket anomaly record, with the raw score shown
Multi-bucket anomalies give you a different perspective on the behavior of your data. You will want to keep in mind how they are signified and scored via the multi_bucket_ impact field in order for you to include or exclude them, as required, from your reporting or alerting logic. Let's now look forward (yes, pun intended) to how results from forecasts are represented in the results index.
154
Interpreting Results
Forecast results As explained in depth in Chapter 4, Forecasting, we can get Elastic ML to extrapolate into the future the trends of the data that has been analyzed. Recall what we showed in Figure 4.21:
Figure 5.26 – Forecast results first shown in Chapter 4
Remember that the prediction value is the value with the highest likelihood (probability), and that the shaded area is the range of the 95th percentile of confidence. These three key values are stored in the .ml-anomalies-* results indices with the following names: • forecast_prediction • forecast_upper • forecast_lower
Forecast results
155
Querying for forecast results When querying for the forecast results in the .ml-anomalies-* results indices, it is important to remember that forecast results are transient – they have a default lifespan of 14 days following creation, especially if they are created from the UI in Kibana. If a different expiration duration is desired, then the forecast will have to be invoked via the _forecast API endpoint and explicitly setting the expires_in duration. Another thing to remember is that multiple forecasts may have been invoked at different moments in time on the same dataset. As shown back in Figure 4.4 and repeated here, multiple forecast invocations produce multiple forecast results:
Figure 5.27 – A symbolic representation of invoking multiple forecasts at different times
As such, we need a way to discern between the results. In the Kibana UI, they are discernable simply by looking at the Created date:
Figure 5.28 – Viewing multiple previously run forecasts
156
Interpreting Results
However, when looking at the results index, it should be noted that each invoked forecast has a unique forecast_id:
Figure 5.29 – Viewing forecast results in Kibana Discover
This forecast_id is only obvious when invoking the forecast using the _forecast API because forecast_id is returned as part of the payload of the API call. Therefore, if multiple forecasts were created spanning a common time frame, there would be more than one result with different IDs. When querying the forecast results, you can think of two possible orientations: • Value-focused: The query supplies a date and time, and the result is a particular value for that time is returned. The question, "What is my utilization 5 days from now?" would be a good example. • Time-focused: The query supplies a value, and the result is a time at which that value is realized. The question, "When does my utilization reach 80%?" would be a good example.
Results API
157
Obviously, either type of query is possible. To satisfy the time-focused inquiry, for example, we need to re-orient the query a little to ask it to return the date (or dates) on which the predicted values meet certain criteria. The user can query for the forecast results using other traditional query methods (KQL, Elasticsearch DSL), but to mix it up a little, we'll submit the query using Elastic SQL in the Kibana Dev Tools Console: POST _sql?format=txt { "query": "SELECT forecast_prediction,timestamp FROM \".mlanomalies-*\" WHERE job_id='forecast_example' AND forecast_ id='Fm5EiHcBpc7Wt6MbaGcw' AND result_type='model_forecast' AND forecast_prediction>'16890' ORDER BY forecast_prediction DESC" }
Here, we are asking whether there are any times during which the predicted value exceeds our limit of the value of 16,890. The response is as follows: forecast_prediction| timestamp -------------------+-----------------------16893.498325784924 |2017-03-17T09:45:00.000Z
In other words, we may breach the threshold on March 17 at 9:45 A.M. GMT (although remember from Chapter 4, Forecasting, that the sample data used is from the past and therefore forecast predictions are also in the past). Now that we have a good understanding of how to query for forecast results, we could include them in dashboards and visualizations, which we will cover later in this chapter – or even in alerts, as we'll see in Chapter 6, Alerting on ML Analysis. But, before we look at including results in custom dashboards and visualizations, let's cover one last brief topic – the Elastic ML results API.
Results API If programmatic access to the results is your thing, in addition to querying the results indices directly, you could opt to instead query Elastic ML's results API. Some parts of the API are redundant to what we've already explored, and some parts are unique. We will now check them out in the upcoming sections.
158
Interpreting Results
Results API endpoints There are five different results API endpoints available: • Get buckets • Get influencers • Get records • Get overall buckets • Get categories The first three API endpoints give results that are redundant in light of what we've already covered in this chapter by way of querying the results index directly (through Kibana or using the Elasticsearch _search API), and that method actually allows more flexibility, so we really won't bother discussing them here. However, the last two API endpoints are novel, and each deserves an explanation.
Getting the overall buckets API The overall buckets API call is a means by which to return summarized results across multiple anomaly detection jobs in a programmatic way. We're not going to explore every argument of the request body, nor will we describe every field in the response body, as you can reference the documentation. But we will discuss here the important function of this API call, which is to request the results from an arbitrary number of jobs, and to receive a single result score (called overall_score) that encapsulates the top_n average of the maximum bucket anomaly_score for each job requested. As shown in the documentation, an example call is one that asks for the top two jobs (in the set of jobs that begin with the name job-) whose bucket anomaly score, when averaged together, is higher than 50.0, starting from a specific timestamp: GET _ml/anomaly_detectors/job-*/results/overall_buckets { "top_n": 2, "overall_score": 50.0, "start": "1403532000000" }
Results API
159
This will result in the following sample return: { "count": 1, "overall_buckets": [ { "timestamp" : 1403532000000, "bucket_span" : 3600, "overall_score" : 55.0, "jobs" : [ { "job_id" : "job-1", "max_anomaly_score" : 30.0 }, { "job_id" : "job-2", "max_anomaly_score" : 10.0 }, { "job_id" : "job-3", "max_anomaly_score" : 80.0 } ], "is_interim" : false, "result_type" : "overall_bucket" } ] }
Notice that overall_score is the average of the two highest scores in this case (the result of overall_score of 55.0 is the average of the job-3 score of 80.0 and the job-1 score of 30.0), even though three anomaly detection jobs match the query pattern of job-*. While this is certainly interesting, perhaps for building a composite alert, you should realize the limitations in this reporting, especially if you can only access the bucket-level anomaly score and not anything from the record or influencer level. In Chapter 6, Alerting on ML Analysis, we will explore some options regarding composite alerting.
160
Interpreting Results
Getting the categories API The categories API call is only relevant for jobs that leverage categorization, as described in detail in Chapter 3, Anomaly Detection. The categories API returns some interesting internal definitions of the categories found during the textual analysis of the documents. If we run the API on the categorization job that we created back in Chapter 3, Anomaly Detection (abbreviated to only return one record for brevity), the output is as follows: GET _ml/anomaly_detectors/secure_log/results/categories { "page":{ "size": 1 } }
We will see the following response: { "count" : 23, "categories" : [ { "job_id" : "secure_log", "category_id" : 1, "terms" : "localhost sshd Received disconnect from port", "regex" : ".*?localhost.+?sshd.+?Received.+?disconnect.+?from.+?port.*", "max_matching_length" : 122, "examples" : [ "Oct 22 15:02:19 localhost sshd[8860]: Received disconnect from 58.218.92.41 port 26062:11: [preauth]", "Oct 22 22:27:20 localhost sshd[9563]: Received disconnect from 178.33.169.154 port 53713:11: Bye [preauth]", "Oct 22 22:27:22 localhost sshd[9565]: Received disconnect from 178.33.169.154 port 54877:11: Bye [preauth]", "Oct 22 22:27:24 localhost sshd[9567]: Received disconnect from 178.33.169.154 port 55723:11: Bye [preauth]" ], "grok_pattern" : ".*?%{SYSLOGTIMESTAMP:timestamp}.+ ?localhost.+?sshd.+?%{NUMBER:field}.+?Received.+?
Custom dashboards and Canvas workpads
161
disconnect.+?from.+?%{IP:ipaddress}.+?port.+?%{NUMBER:field2}.+ ?%{NUMBER:field3}.*", "num_matches" : 595 } ] }
Several elements are part of the reply: • category_id: This is the number of the category of message (incremented from 1). It corresponds to the value of the mlcategory field in the results index. • terms: This is a list of the static, non-mutable words extracted from the message. • examples: An array of complete, unaltered sample log lines that fall into this category. These are used to show the users what some of the real log lines look like. • grok_pattern: A regexp-style pattern match that could be leveraged for Logstash or an ingest pipeline that you could use to match this message category. • num_matches: A count of the number of times this message category was seen in the logs throughout the anomaly detection job running on this dataset. Perhaps the most interesting use of this API is not for anomaly detection, but rather around merely understanding the unique number of category types and the distribution of those types in your unstructured logs – to answer questions such as, "What kinds of messages are in my logs and how many of each type?" Some of this capability may be leveraged in the future to create a "data preparation" pipeline to assist users in ingesting unstructured logs into Elasticsearch more easily. Let's now explore how results gleaned from Elastic ML's anomaly detection and forecasting jobs can be leveraged in custom dashboards, visualizations, and Canvas workpads.
Custom dashboards and Canvas workpads It's clear that now that we know the ins and outs of the results index, which stores all the goodness that comes out of Elastic ML's anomaly detection and forecast analytics, our imagination is the limit concerning how we can then express those results in a way that is meaningful for our own goals. This section will briefly explore some of the concepts and ideas that you can use to bring Elastic ML's results to a big screen near you!
162
Interpreting Results
Dashboard "embeddables" One recent addition to the capabilities of Elastic ML is the ability to embed the Anomaly Explorer timeline ("swim lanes") into existing custom dashboards. To accomplish this, simply click the "three dots" menu at the top right of the Anomaly timeline and select the Add to dashboard option:
Figure 5.30 – Adding the Anomaly timeline to another dashboard
At this point, select which part of the swim lane views you want to include and select which dashboard(s) you wish to add them to:
Figure 5.31 – Adding the Anomaly timeline to a specific dashboard
Custom dashboards and Canvas workpads
163
Clicking on the Add and edit dashboard button will then transport the user to the target dashboard and allow them to move and resize the embedded panels. For example, we can have the anomalies side by side with the other visualizations:
Figure 5.32 – New dashboard now containing Anomaly swim lane visualizations
164
Interpreting Results
Anomalies as annotations in TSVB The Time Series Visual Builder (TSVB) component in Kibana is an extremely flexible visualization builder that allows users to not only plot their time series data but also annotate that data with information from other indices. This is a perfect recipe for plotting some raw data, but then overlaying anomalies from an anomaly detection job on top of that raw data. For example, you could create a TSVB for kibana_sample_data_logs with the following panel options:
Figure 5.33 – Creating a new TSVB visualization – Panel options
Then, there is the following configuration for the Data tab to do a terms aggregation for the top five origin countries (geo.src):
Figure 5.34 – Creating a new TSVB visualization – Data options
Custom dashboards and Canvas workpads
165
Then, we have the following configuration for the Annotations tab to overlay the results of a previously created anomaly detection job named web_traffic_per_country to select anomalies with record scores over 90:
Figure 5.35 – Creating a new TSVB visualization – Annotation options
Note the Query string entry as something pretty recognizable given what we've learned in this chapter. The TSVB also requires a comma-separated list of fields to report on for the annotations (here we enter record_score and partition_field_value) and Row template, which defines how the information is formatted in the annotation (here we define it to be Anomaly:{{record_score}} for {{partition_field_ value}}). Once this is done, we have the final result:
Figure 5.36 – Creating a new TSVB visualization complete with anomaly annotations
We now have a nice visualization panel with anomalies superimposed on the original raw data.
166
Interpreting Results
Customizing Canvas workpads Kibana Canvas is the ultimate tool for creating pixel-perfect infographics that are datadriven from Elasticsearch. You can create highly custom-tailored reports with a set of customizable elements. The experience in Canvas is very different from standard Kibana dashboards. Canvas presents you with a workspace where you can build sets of slides (similar in concept to Microsoft PowerPoint) called the workpad. To leverage anomaly detection and/or forecast results in a Canvas workpad, there isn't anything special that needs to be done – everything that has been learned so far in this chapter is applicable. This is because it is very easy to use the essql command in Canvas to query the .ml-anomalies-* index pattern and extract the information we care about. When we install the Kibana sample data, we also get a few sample Canvas workpads to enjoy:
Figure 5.37 – Sample Canvas workpads
Clicking on the [Logs] Web Traffic sample workpad opens it for us to edit:
Custom dashboards and Canvas workpads
167
Figure 5.38 – Sample web traffic workpad
Selecting one of the elements on the page (perhaps the TOTAL VISITORS counter at the bottom, which currently shows the value of 324) and then selecting Expression editor at the bottom-right corner of Canvas will reveal the details of the element:
Figure 5.39 – Editing a Canvas element in the Expression editor
168
Interpreting Results
Notice that the real "magic" of obtaining live data is embedded in the essql command – the rest of the expression is merely formatting. As a simple example, we can adjust the SQL with the following syntax: SELECT COUNT(timestamp) as critical_anomalies FROM \".mlanomalies-*\" WHERE job_id='web_logs_rate' AND result_ type='record' AND record_score>'75'
One thing to note is that because the .ml-anomalies-* index pattern's name begins with a non-alphabet character, the name needs to be enclosed in double-quotes, and those double-quotes need to be escaped with the backslash character. This will return the total number of critical anomalies (those that have a record_score larger than 75) for a particular anomaly detection job on that dataset:
Figure 5.40 – Displaying the count of critical anomalies
In short, it is quite easy to use Canvas to create very beautiful and meaningful data visualizations and leverage information from either anomaly detection results or forecast results.
Summary
169
Summary Elastic ML's anomaly detection and forecasting analytics creates wonderful and meaningful results that are explorable via the rich UI that is provided in Kibana, or programmatically via direct querying of the results indices and the API. Understanding the results of your anomaly detection and forecasting jobs and being able to appropriately leverage that information for further custom visualizations or alerts makes those custom assets even more powerful. In the next chapter, we'll leverage the results to create sophisticated and useful proactive alerts to further increase the operational value of Elastic ML.
6
Alerting on ML Analysis The previous chapter (Chapter 5, Interpreting Results) explained in depth how anomaly detection and forecasting results are stored in Elasticsearch indices. This gives us the proper background to now create proactive, actionable, and informative alerts on those results. At the time of writing this book, we find ourselves at an inflection point. For several years, Elastic ML has relied on the alerting capabilities of Watcher (a component of Elasticsearch) as this was the exclusive mechanism to alert on data. However, a new platform of alerting has been designed as part of Kibana (and was deemed GA in v7.11) and this new approach will be the primary mechanism of alerting moving forward. There are still some interesting pieces of functionality that Watcher can provide that are not yet available in Kibana alerting. As such, this chapter will showcase the usage of alerts using both Kibana alerting and Watcher. Depending on your needs, you can decide which approach you would like to use. Specifically, this chapter will cover the following topics: • Understanding alerting concepts • Building alerts from the ML UI • Creating alerts with a watch
172
Alerting on ML Analysis
Technical requirements The information in this chapter will use the Elastic Stack as it exists in v7.12.
Understanding alerting concepts Hopefully, without running the risk of being overly pedantic, a few declarations can be made here about alerting and how certain aspects of alerting (especially with respect to anomaly detection) are extremely important to understand before we get into the mechanics of configuring those alerts.
Anomalies are not necessarily alerts This needs to be explicitly said. Often, users who first embrace anomaly detection feel compelled to alert on everything once they realize that you can alert on anomalies. This is potentially a really challenging situation if anomaly detection is deployed across hundreds, thousands, or even tens of thousands of entities. Anomaly detection, while certainly liberating users from having to define specific, rule-driven exceptions or hardcoded thresholds from alerts, also has the potential to be deployed broadly across a lot of data. We need to be cognizant that detailed alerting on every little anomaly could be potentially quite noisy if we're not careful. Fortunately, there are a few mechanisms that we've already learned about in Chapter 5, Interpreting Results, that help us mitigate such a situation: • Summarization: We learned that anomalousness is not only reported for individual anomalies (at the "record level") but is also summarized at the bucket level and influencer level. These summary scores can facilitate alerting at a higher level of abstraction if we so desire. • Normalized scoring: Because every anomaly detection job has a custom normalization scale that is purpose-built for the specific detector configuration and dataset being analyzed, it means that we can leverage the normalized scoring that comes out of Elastic ML to rate-limit the typical alerting cadence. Perhaps for a specific job that you create, alerting at a minimum anomaly score of 10 will typically give about a dozen alerts per day, a score of 50 will give about one per day, and a score of 90 will give about one alert per week. In other words, you can effectively tune the alerting to your own tolerance for the number of alerts you'd prefer to get per unit of time (of course, except for the case of an unexpected system-wide outage, which may create more alerts than usual).
Understanding alerting concepts
173
• Correlation/combination: Perhaps alerting on a single metric anomaly (a host's CPU being abnormally high) is not as compelling as a group of related anomalies (CPU is high, free memory is low, and response time is also high). Alerting on compound events or sequences may be more meaningful for some situations. The bottom line is that even though there isn't a one-size-fits-all philosophy about the best way to structure alerting and increase the effectiveness of alerts, there are some options available to the user in order to choose what may be right for you.
In real-time alerting, timing matters Back in Chapter 2, Enabling and Operationalization, we learned that anomaly detection jobs are a relatively complex orchestration of querying of raw data, analyzing that data, and reporting of the results as an ongoing process that can run in near real time. As such, there were a few key aspects of the job's configuration that determined the cadence of that process, namely the bucket_span, frequency, and query_delay parameters. These parameters define when results are "available" and what timestamp the values will have. This is extremely important because alerting on anomaly detection jobs will involve a subsequent query to the results indices (.ml-anomalies-*), and clearly, when that query is run and what time range is used matters whether or not you actually find the anomalies you are looking for. To illustrate, let's look at the following:
Figure 6.1 – A representation of the bucket span, query delay, and frequency with respect to now
174
Alerting on ML Analysis
In Figure 6.1, we see that a particular bucket of time (represented by the width of time equal to t2-t1), lags the current system time (now) by an amount equal to query_delay. Within the bucket, there may be subdivisions of time, as defined by the frequency parameter. With respect to how the results of this bucket are written to the results indices (.ml-anomalies-*), we should remember that the timestamp of the documents written for this bucket will all have a timestamp value equal to the time at t1, the leading edge of the bucket. To make a practical example for discussion, let's imagine the following: • bucket_span = 15 minutes • frequency = 15 minutes • query_delay = 2 minutes If now is 12:05 P.M., then the bucket corresponding to 11:45 A.M.-12:00 P.M. was queried and processed by Elastic ML sometime around 12:02 P.M. (due to the lag of query_ delay) and the results document was written into .ml-anomalies-* soon after (but written with a timestamp equal to 11:45 A.M.). Therefore, if at 12:05 P.M. we looked into .ml-anomalies-* to see whether results were there for 11:45 A.M., we would be pretty confident they would exist, and we could inspect the content. However, if now were only 12:01 P.M., the results documents for the bucket corresponding to 11:45 A.M.-12:00 P.M. would not yet exist and wouldn't be written for another minute or so. We can see that the timing of things is very important. If in our example scenario, we instead had reduced the value of frequency to 7.5 minutes or 5 minutes, then we would indeed have access to the results of the bucket "sooner," but the results would be marked as interim and are subject to change when the bucket is finalized. Note Interim results are created within a bucket if the frequency is a sub-multiple of the bucket span, but not all detectors make interim results. For example, if you have a max or high_count detector, then an interim result that shows a higher-than-expected value over the typical value is possible and sensible – you don't need to see the contents of the entire bucket to know that you've already exceeded expectations. However, if you have a mean detector, you really do need to see that entire bucket's worth of observations before determining the average value – therefore, interim results are not produced because they are not sensible.
Understanding alerting concepts
175
So, with that said, if we now take the diagram from Figure 6.1 and advance time a little, but also draw the buckets before and after this one, it will look like the following:
Figure 6.2 – A representation of consecutive buckets
Here in Figure 6.2, we see that the current system time (again, denoted by now) is in the middle of bucket t2 – therefore, bucket t2 is not yet finalized and if there are any results written for bucket t2 by the anomaly detection job, they will be marked with the is_ interim:true flag as first shown in Chapter 5, Interpreting Results. If we wanted to invoke an alert search that basically asked the question, "Are there any new anomalies created since last time I looked?" and the time that we invoked that search was done at the time that is now in Figure 6.2, then we should notice the following: • The "look back" period of time should be about twice the width of bucket_span. This is because this guarantees that we will see any interim results that may be published for the current bucket (here bucket t2) and any finalized results for the previous bucket (here bucket t1). Results from bucket t0 will not be matched because the timestamp for bucket t0 is outside of the window of time queried – this is okay as long as we get the alert query to repeat on a proper schedule (see the following point). • The time chosen to run this query could fall practically anywhere within bucket t2's window of time and this will still work as described. This is important because the schedule at which the alert query runs will likely be asynchronous to the schedule that the anomaly detection job is operating (and writing results). • We would likely schedule our alert search to repeat its operation at most at an interval equal to bucket_span, but it could be executed more frequently if we're interested in catching interim anomalies in the current, not-yet-fiinalized bucket. • If we didn't want to consider interim results, we would need to modify the query such that is_interim:false was part of the query logic to not match them.
176
Alerting on ML Analysis
Given all of these conditions, you might think that there is some type of dark magic that is required to get this working correctly and reliably. Fortunately, when we build the alerts using Kibana from the Elastic ML UI, these considerations are taken care of for you. However, if you feel like you are a wizard and fully understand how this all works, then you may not be too intimidated by the prospect of building very custom alert conditions using Watcher, where you will have complete control. In the following main sections, we'll do some examples using each method so that you can compare and contrast how they work.
Building alerts from the ML UI With the release of v7.12, Elastic ML changed its default alert handler from Watcher to Kibana alerting. Prior to v7.12, the user had a choice of accepting a default watch (an instance of a script for Watcher) if alerting was selected from the ML UI, or the user could create a watch from scratch. This section will focus on the new workflow using Kibana alerting as of v7.12, which offers a nice balance of flexibility and ease of use. To create a working, illustrative example of real-time alerting, we will contrive a scenario using the Kibana sample web logs dataset that we first used in Chapter 3, Anomaly Detection. The process outlined in this section will be as follows: 1. Define some sample anomaly detection jobs on the sample data. 2. Define two alerts on two of the anomaly detection jobs. 3. Run a simulation of anomalous behavior, to catch that behavior in an alert. Let's first define the sample anomaly detection jobs.
Defining sample anomaly detection jobs Of course, before we can build alerts, we need jobs running in real time. We can leverage the sample ML jobs that come with the same Kibana web logs dataset. Note If you still have this dataset loaded in your cluster, you should delete it and re-add it. This will reset the timestamps on the dataset so that about half of the data will be in the past and the rest will be in the future. Having some data in the future will allow us to pretend that data is appearing in real time and therefore our real-time anomaly detection jobs and our alerts on those jobs will act like they are truly real time.
Building alerts from the ML UI
177
To get started, let's reload the sample data and build some sample jobs: 1. From the Kibana home screen, click on Try our sample data:
Figure 6.3 – Kibana home screen
2. Click on Index Patterns in the Sample web logs section (if already loaded, please remove and re-add):
Figure 6.4 – Add sample web logs data
178
Alerting on ML Analysis
3. Under the View data menu, select ML jobs to create some sample jobs:
Figure 6.5 – Selecting to create some sample ML jobs
4. Give the three sample jobs a job ID prefix (here alert-demo- was chosen) and make sure you de-select Use full kibana_sample_data_logs data and pick the end time to be the closest 15 minutes to your current system time (in your time zone):
Building alerts from the ML UI
179
Figure 6.6 – Naming the sample jobs with a prefix and selecting now as the end time
5. Notice in Figure 6.6 that Apr 8, 2021 @ 11:00:00.00 was chosen as the end time and that a date of 11 days earlier (Mar 28, 2021 @ 00:00:00.00) was chosen as the start time (the sample data goes back about 11 days from when you install it). The current local time at the time of this screenshot was 11:10 A.M. on April 8th. This is important in the spirit of trying to make this sample data seem real time. Click the Create Jobs button to set the job creation in motion. Once the jobs are created, you will see the following screen:
Figure 6.7 – Sample jobs completed initial run
180
Alerting on ML Analysis
6. We don't need to view the results just yet. Instead, we need to make sure these three jobs are running in real time. Let's click Anomaly Detection at the top to return us to the Job Management page. There we can see our three jobs have analyzed some data but are now in the closed state with the data feeds currently stopped:
Figure 6.8 – Sample jobs in the Jobs Management screen
7. Now we need to enable these three jobs to run in real time. Click the boxes next to each job, and then select the gear icon to bring up the menu to choose Start data feeds for all three jobs:
Figure 6.9 – Starting the datafeed for all three sample jobs
Building alerts from the ML UI
181
8. In the pop-up window, choose the top option for both Search start time and Search end time, ensuring that the job will continue to run in real time. For now, we will leave Create alert after datafeed has started unchecked as we will create our own alerts in just a moment:
Figure 6.10 – Starting the data feeds of the three sample jobs to run in real time
9. After clicking the Start button, we will see that our three jobs are now in the opened/started state:
Figure 6.11 – Sample jobs now running in real time
Now that we have our jobs up and running, let's now define a few alerts against them.
182
Alerting on ML Analysis
Creating alerts against the sample jobs With our jobs running in real time, we can now define some alerts for our jobs: 1. For the alert-demo-response_code_rates job, click the … icon and select Create alert:
Figure 6.12 – Creating an alert for a sample job
2. Now the Create alert flyout window appears, and we can now begin to fill in our desired alert configuration:
Building alerts from the ML UI
183
Figure 6.13 – Creating an alert configuration
3. In Figure 6.13, we will name our alert, but will also define that we wish to have this alert check for anomalies every 10 minutes. This job's bucket_span is set for 1 hour, but the frequency is set to 10 minutes – therefore interim results will be available much sooner than the full bucket time. This is also why we chose to include interim results in our alert configuration, so that we can get notified as soon as possible. We also set Result type to be of type Bucket to give us a summarized treatment of the anomalousness, as previously discussed. Finally, we set the severity threshold to 51 to have alerts be generated only for anomalies of a score exceeding that value.
184
Alerting on ML Analysis
4. Before we continue too far, we can check the alert configuration on past data. Putting 30d into the test box, we can see that there was only one other alert in the last 30 days' worth of data that matched this alert condition:
Figure 6.14 – Testing alert configuration on past data
Building alerts from the ML UI
185
5. Lastly, we can configure an action to invoke on an alert being fired. In this case, our system was pre-configured to use Slack as an alert action, so we will choose that here, but there are many other options available for the user to consider (please see https://www.elastic.co/guide/en/kibana/current/actiontypes.html to explore all options available and how to customize the alert messaging):
Figure 6.15 – Configuring alert action
186
Alerting on ML Analysis
6. Clicking on the Save button will obviously save the alert, which is then viewable and modifiable via the Stack Management | Alerts and Actions area of Kibana:
Figure 6.16 – Alerts management
7. We are going to create one more alert, for the alert-demo-url_scanning job. This time, we'll create a Record alert, but with the other configuration parameters similar to the prior example:
Building alerts from the ML UI
Figure 6.17 – Configuring another alert on the URL scanning job
Now that we have our two alerts configured, let's move on to simulating an actually anomalous situation in real time to trigger our alerts.
187
188
Alerting on ML Analysis
Simulating some real-time anomalous behavior Triggering simulated anomalous behavior in the context of these sample web logs is a little tricky, but not too hard. It will involve some usage of the Elasticsearch APIs, executing a few commands via the Dev Tools console in Kibana. Console is where you can issue API calls to Elasticsearch and see the output (response) of those API calls. Note If you are unfamiliar with Console, please consult https://www.
elastic.co/guide/en/kibana/current/console-kibana. html.
What we will be simulating is twofold – we'll inject several fake documents into the index that the anomaly detection job is monitoring, and then wait for the alert to fire. These documents will show a spike in requests from a fictitious IP address of 0.0.0.0 that will result in a response code of 404 and will also be requesting random URL paths. Let's get started: 1. We need to determine the current time for you in UTC. We must know the UTC time (as opposed to your local time zone's time) because the documents stored in the Elasticsearch index are stored in UTC. To determine this, you can simply use an online tool (such as Googling current time utc). At the time of writing, the current UTC time is 4:41 P.M. on April 8, 2021. Converted into the format that Elasticsearch expects for the kibana_sample_data_logs index, this will take the form of this: "timestamp": "2021-04-08T16:41:00.000Z"
2. Let's now insert some new bogus documents into the kibana_sample_data_ logs index at the current time (perhaps with a little buffer – rounding up to the next half hour, in this case to 17:00). Replace the timestamp field value accordingly and invoke the following command at least 20 times in the Dev Tools console to insert: POST kibana_sample_data_logs/_doc { "timestamp": "2021-04-08T17:00:00.000Z", "event.dataset" : "sample_web_logs", "clientip": "0.0.0.0", "response" : "404",
Building alerts from the ML UI
189
"url": "" }
3. We can then dynamically modify only the documents we just inserted (in particular, the url field) to simulate that the URLs are all unique by using a little script to randomize the field value in an _update_by_query API call: POST kibana_sample_data_logs/_update_by_query { "query": { "term": { "clientip": { "value": "0.0.0.0" } } }, "script": { "lang": "painless", "source": "ctx._source.url = '/path/to/'+ UUID. randomUUID().toString();" } }
190
Alerting on ML Analysis
4. We can validate that we have correctly created a bunch of unique, random requests from our bogus IP address by looking that the appropriate time in Kibana Discover:
Figure 6.18 – Our contrived burst of anomalous events shown in Discover
5. Notice in Figure 6.18 that we had to peek into the future a little to see the documents we artificially inserted (as the red vertical line in the timeline near 12:45 P.M. is the actual current system time in the local time zone). Also notice that our inserted documents have a nice-looking random url field as well. Now that we have "laid the trap" for anomaly detection to find, and we have our alerts ready, we must now sit back and patiently wait for the alerts to trigger.
Building alerts from the ML UI
191
Receiving and reviewing the alerts Since the anomalous behavior we inserted is now waiting to be found by our anomaly detection job and our alerts, we can contemplate when we should expect to see that alert. We should recognize that given our jobs have bucket spans of 1 hour, frequencies of 10 minutes, and query delays on the order of 1-2 minutes (and that our alerts will indeed look for interim results – and that our alerts are running with a 10-minute frequency that is asynchronous from the anomaly detection job), we should expect to see our alerts between 1:12 P.M. and 1:20 P.M. local time. Right on cue, the alert messages for the two jobs surface in Slack at 1:16 P.M. and 1:18 P.M. local time:
Figure 6.19 – Alerts received in the Slack client
192
Alerting on ML Analysis
The top alert in Figure 6.19, of course, is for the anomaly detection job that was counting the number of events for each response.keyword (and thus seeing the spike of 404 documents exceeds expectations) and the bottom alert is for the other job that notices the high distinct count of unique URLs that were being requested. Notice that both jobs correctly identify clientip = 0.0.0.0 as an influencer into the anomalies. Included in the alert text is the ability to follow the link to directly view the information in Anomaly Explorer. In Figure 6.20, we can see that by following the link in the second alert, we arrive at a familiar place to investigate the anomaly further:
Figure 6.20 – Anomaly Explorer from the alert drill-down link
Hopefully, through this example you can see not only how to use the alerts using the Kibana alerting framework on anomaly detection jobs but can now also appreciate the intricacies of the real-time operation of both the job and the alert. The settings within the job's datafeed and alert sampling interval truly affect how real-time the alerts can be. We could have, for example, reduced both the datafeed frequency and the alert's Check every setting to shave a few minutes off. In the next section, we won't attempt to replicate a real-time alert detection with Watcher, but we will work to understand the equivalent settings within a watch to accomplish what we need to interface Watcher to an anomaly detection job and to also showcase some interesting example watches.
Creating an alert with a watch
193
Creating an alert with a watch Prior to version 7.12, Watcher was used as the mechanism to alert on anomalies found by Elastic ML. Watcher is a very flexible native plugin for Elasticsearch that can handle a number of automation tasks and alerting is certainly one of them. In versions 7.11 and earlier, users could either create their own watch (an instance of an automation task in Watcher) from scratch to alert on anomaly detection job results or opt to use a default watch template that was created for them by the Elastic ML UI. We will first look at the default watch that was provided and then will discuss some ideas around custom watches.
Understanding the anatomy of the legacy default ML watch Now that alerting on anomaly detection jobs is handled by the new Kibana alerting framework, the legacy watch default template (plus a few other examples) are memorialized in a GitHub repository here: https://github.com/elastic/ examples/tree/master/Alerting/Sample%20Watches/ml_examples. In dissecting the default ML watch (default_ml_watch.json) and the companion version, which has an email action (default_ml_watch_email.json), we see that there are four main sections: • trigger: Defines the scheduling of the watch • input: Specifies the input data to be evaluated • condition: Assesses whether or not the actions section is executed • actions: Lists the desired actions to take if the watch's condition is met Note For a full explanation of all of the options of Watcher, please consult the Elastic documentation at https://www.elastic.co/guide/en/
elasticsearch/reference/current/how-watcher-works. html.
Let's discuss each section in depth.
194
Alerting on ML Analysis
The trigger section In the default ML watch, the trigger section is defined as follows: "trigger": { "schedule": { "interval": "82s" } },
Here, we can see that the interval at which the watch will fire in real time is every 82 seconds. This usually should be a random value between 60 and 120 seconds so that if a node restarts, all of the watches will not be synchronized, and they will have their execution times more evenly spread out to reduce any potential load on the cluster. It is also important that this interval value is less than or equal to the bucket span of the job. As explained earlier in this chapter, having it larger than the bucket span may cause recently written anomaly records to be missed by the watch. With the interval being less (or even much less) than the bucket span of the job, you can also take advantage of the advanced notification that is available when there are interim results, anomalies that can still be determined despite not having seen all of the data within a bucket span.
The input section The input section starts with a search section in which the following query is defined against the .ml-anomalies-* index pattern: "query": { "bool": { "filter": [ { "term": { "job_id": "" } }, { "range": { "timestamp": { "gte": "now-30m" } }
Creating an alert with a watch
195
}, { "terms": { "result_type": [ "bucket", "record", "influencer" ] } } ] } },
Here, we are asking Watcher to query for bucket, record, and influencer result documents for a job (you would replace with the actual job_id for the anomaly detection job of interest) in the last 30 minutes. As we know from earlier in the chapter, this look-back window should be twice the bucket_span value of the ML job (this template must assume that the job's bucket span is 15 minutes). While all result types were asked for, we will later see that only the bucket-level results are used to evaluate whether or not to create an alert. Next comes a series of three aggregations. When they're collapsed, they look as follows:
Figure 6.21 – Query aggregations in the watch input
The bucket_results aggregation first filters for buckets where the anomaly score is greater than or equal to 75: "aggs": { "bucket_results": { "filter": { "range": { "anomaly_score": { "gte": 75
196
Alerting on ML Analysis
} } },
Then, a sub-aggregation asks for the top 1 bucket sorted by anomaly_score: "aggs": { "top_bucket_hits": { "top_hits": { "sort": [ { "anomaly_score": { "order": "desc" } } ], "_source": { "includes": [ "job_id", "result_type", "timestamp", "anomaly_score", "is_interim" ] }, "size": 1,
Next, still within the top_bucket_hits sub-aggregation, there are a series of defined script_fields: "script_fields": { "start": { "script": { "lang": "painless", "source": "LocalDateTime. ofEpochSecond((doc[\"timestamp\"].value.getMillis()((doc[\"bucket_span\"].value * 1000)\n * params.padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"",
Creating an alert with a watch "params": { "padding": 10 } } }, "end": { "script": { "lang": "painless", "source": "LocalDateTime. ofEpochSecond((doc[\"timestamp\"].value. getMillis()+((doc[\"bucket_span\"].value * 1000)\n * params. padding)) / 1000, 0, ZoneOffset.UTC).toString()+\":00.000Z\"", "params": { "padding": 10 } } }, "timestamp_epoch": { "script": { "lang": "painless", "source": """doc["timestamp"]. value.getMillis()/1000""" } }, "timestamp_iso8601": { "script": { "lang": "painless", "source": """doc["timestamp"]. value""" } }, "score": { "script": { "lang": "painless", "source": """Math. round(doc["anomaly_score"].value)""" }
197
198
Alerting on ML Analysis
} }
These newly defined variables will be used by the watch to provide more functionality and context. Some of the variables are merely reformatting values (score is just a rounded version of anomaly_score), while start and end will later fill a functional role by defining a start and end time that is equal to +/- 10 bucket spans from the time of the anomalous bucket. This is later used by the UI to show an appropriate contextual time range before and after the anomalous bucket so that the user can see things more clearly. The influencer_results and record_results aggregations ask for the top three influencer scores and record scores, but only the output of the record_results aggregation is used in subsequent parts of the watch (and only in the action section of default_ml_watch_email.json, which contains some default email text).
The condition section The condition section is where input is evaluated to see whether or not the action section is executed. In this case, the condition section is as follows: "condition": { "compare": { "ctx.payload.aggregations.bucket_results.doc_count": { "gt": 0 } } },
We are using this to check whether the bucket_results aggregation returned any documents (where doc_count is greater than 0). In other words, if the bucket_ results aggregation did indeed return non-zero results, that indicates that there were indeed documents where anomaly_score was greater than 75. If true, then the action section will be invoked.
Creating an alert with a watch
199
The action section The action section of our default watches has two parts in our case: one log action for logging information to a file and a send_email action for sending an email. The text of the watch won't be repeated here for brevity (it is a lot of text). The log action will print a message to an output file, which by default is the Elasticsearch log file. Notice that the syntax of the message is using the templating language called Mustache (named because of its prolific usage of curly braces). Simply put, variables contained in Mustache's double curly braces will be substituted with their actual values. As a result, for one of the sample jobs we created earlier in the chapter, the logging text written out to the file may look as follows: Alert for job [alert-demo-response_code_rates] at [2021-04-08T17:00:00.000Z] score [91]
This alert should look familiar to what we saw in our Slack message earlier in the chapter – of course, because it is derived from the same information. The email version of the action may look as follows: Elastic Stack Machine Learning Alert Job: alert-demo-response_code_rates Time: 2021-04-08T17:00:00.000Z Anomaly score: 91 Click here to open in Anomaly Explorer. Top records: count() [91]
It is obvious that the format of the alert HTML is really oriented not around getting the user a summary of the information but enticing the user to investigate further by clicking on the link within the email. Also, it is notable that the top three records are reported in the text of the email response. In our example case, there is only one record (a count detector with a score of 91). This section of information came from the record_results aggregation we described previously in the input section of the watch.
200
Alerting on ML Analysis
This default watch is a good, usable alert that provides summarized information about the unusualness of the dataset over time, but it is also good to understand the implications of using this: • The main condition for firing the alert is a bucket anomaly score above a certain value. Therefore, it would not alert on individual anomalous records within a bucket in the case where their score does not lift the overall bucket score above the stated threshold. • By default, only a maximum of the top three record scores in the bucket are reported in the output, and only in the email version. • The only action in these examples is logging and email. Adding other actions (Slack message, webhook, and so on) would require manually editing the watch. Knowing this information, it may become necessary at some point to create a more fullfeatured, complex watch to fully customize the behavior and output of the watch. In the next section, we'll discuss some more examples of creating a watch from scratch.
Custom watches can offer some unique functionality For those who feel emboldened and want to dig deeper into some advanced aspects of Watcher, let's look at some highlights from a few of the other samples in the GitHub repository. These include examples of querying the results of multiple jobs at once, programmatically combining the anomaly scores, and dynamically gathering additional potential root-cause evidence of other anomalies correlated in time.
Chained inputs and scripted conditions A nice example of an interesting watch is multiple_jobs_watch.json, which shows the ability to do a chained input (doing multiple queries against the results of multiple jobs) but also executing a more dynamic condition using a script: "condition" : { "script" : { // return true only if the combined weighted scores are greater than 75 "source" : "return ((ctx.payload.job1.aggregations.max_ anomaly_score.value * 0.5) + (ctx.payload.job2.aggregations. max_anomaly_score.value * 0.2) + (ctx.payload.job3. aggregations.max_anomaly_score.value * 0.1)) > 75" } },
Creating an alert with a watch
201
This is basically saying that the alert only gets triggered if the combined weighted anomaly scores of the three different jobs are greater than a value of 75. In other words, not every job is considered equally important, and the weighting takes that into account.
Passing information between chained inputs Another unique aspect of chained inputs is that information gleaned from one input chain can be passed along to another. As shown in chained_watch.json, the second and third input chains use the timestamp value learned from the first query as part of the range filter: { "range": { "timestamp": {"gte": "{{ctx.payload.job1.hits. hits.0._source.timestamp}}||-{{ctx.metadata.lookback_window}}", "lte": "{{ctx.payload.job1.hits.hits.0._source.timestamp}}"}}},
This effectively means that the watch is gathering anomalies as evidence culled from a window of time prior to a presumably important anomaly from the first job. This kind of alert aligns nicely to the situation we'll discuss in Chapter 7, AIOps and Root Cause Analysis, in which a real application problem is solved by looking for correlated anomalies in a window of time around the anomaly of a KPI. Therefore, a sample output of this watch could look like this: [CRITICAL] Anomaly Alert for job it_ops_kpi: score=85.4309 at 2021-02-08 15:15:00 UTC Possibly influenced by these other anomalous metrics (within the prior 10 minutes): job:it_ops_network: (anomalies with at least a record score of 10): field=In_Octets: score=11.217614808972602, value=13610.62255859375 (typical=855553.8944717721) at 2021-0208 15:15:00 UTC field=Out_Octets: score=17.00518, value=1.9079535783333334E8 (typical=1116062.402864764) at 2021-02-08 15:15:00 UTC field=Out_Discards: score=72.99199, value=137.04444376627606 (typical=0.012289061361553099) at 2021-02-08 15:15:00 UTC job:it_ops_sql: (anomalies with at least a record score of 5): hostname=dbserver.acme.com field=SQLServer_Buffer_Manager_ Page_life_expectancy: score=6.023424, value=846.0000000000005 (typical=12.609336298838242) at 2021-02-08 15:10:00 UTC hostname=dbserver.acme.com field=SQLServer_Buffer_Manager_ Buffer_cache_hit_ratio: score=8.337633, value=96.93249340057375 (typical=98.93088463835487) at 2021-02-08 15:10:00 UTC
202
Alerting on ML Analysis
hostname=dbserver.acme.com field=SQLServer_General_Statistics_ User_Connections: score=27.97728, value=168.15000000000006 (typical=196.1486370757187) at 2021-02-08 15:10:00 UTC
Here, the formatting of the output that collates results from each of the three payloads is managed with a hefty transform script that leverages the Java-like Painless scripting language. Note For more information on the Painless scripting language, please consult the Elastic documentation at https://www.elastic.co/guide/en/
elasticsearch/reference/current/modules-scriptingpainless.html.
If you're not intimidated by the code-heavy format of Watcher, you can wield it as a very powerful tool to implement some very interesting and useful alert schemes.
Summary Anomaly detection jobs are certainly useful on their own, but when combined with near real-time alerting, users can really harness the power of automated analysis – while also being confident about getting only alerts that are meaningful. After a practical study of how to effectively capture the results of anomaly detection jobs with real-time alerts, we went through a comprehensive example of using the new Kibana alerting framework to easily define some intuitive alerts and we tested them with a realistic alerting scenario. We then witnessed how an expert user can leverage the full power of Watcher for advanced alerting techniques if Kibana alerting cannot satisfy the complex alerting requirements. In the next chapter, we'll see how anomaly detection jobs can assist not only with alerting on important key performance indicators but also how Elastic ML's automated analysis of a broad set of data within a specific application context is the means to achieving some "AI" on tracking down an application problem and determining its root cause.
7
AIOps and Root Cause Analysis Up until this point, we have extensively explained the value of detecting anomalies across metrics and logs separately. This is extremely valuable, of course. In some cases, however, the knowledge that a particular metric or log file has gone awry may not tell the whole story of what is going on. It may, for example, be pointing to a symptom and not the cause of the problem. To have a better understanding of the full scope of an emerging problem, it is often helpful to look holistically at many aspects of a system or situation. This involves smartly analyzing multiple kinds of related datasets together. In this chapter, we will cover the following topics: • Demystifying the term ''AIOps'' • Understanding the importance and limitations of KPIs • Moving beyond KPIs • Organizing data for better analysis • Leveraging the contextual information • Bringing it all together for RCA
204
AIOps and Root Cause Analysis
Technical requirements The information and examples demonstrated in this chapter are relevant as of v7.11 of the Elastic Stack and utilize sample datasets from the GitHub repo found at https:// github.com/PacktPublishing/Machine-Learning-with-ElasticStack-Second-Edition.
Demystifying the term ''AIOps'' We learned in Chapter 1, Machine Learning for IT, that many companies are drowning in an ever-increasing cascade of IT data while simultaneously being asked to ''do more with less'' (fewer people, fewer costs, and so on). Some of that data is collected and/or stored in specialized tools, but some may be collected in general-purpose data platforms such as the Elastic Stack. But the question still remains: what percentage of that data is being paid attention to? By this, we mean the percentage of collected data that is actively inspected by humans or being watched by some type of automated means (defined alarms based on rules, thresholds, and so on). Even generous estimates might put the percentage in the range of single digits. So, with 90% or more data being collected going unwatched, what's being missed? The proper answer might be that we don't actually know. Before we admonish IT organizations for the sin of collecting piles of data but not watching it, we need to understand the magnitude of the challenge associated with such an operation. A typical user-facing application may do the following: • Span hundreds of physical servers • Have dozens (if not hundreds) of microservices, each of which may have dozens or hundreds of operational metrics or log entries that describe its operation The combinatorics of this can easily rise to a six- or seven-figure range of unique measurement points. Additionally, there may be dozens or even hundreds of such applications under the umbrella of management by the IT organization. It's no wonder that the amount of data being collected by these systems per day can easily be measured in terabytes.
Demystifying the term ''AIOps''
205
So, it is quite natural that the desired solution could involve a combination of automation and artificial intelligence to lessen the burden on human analysts. Some clever marketing person somewhere figured out that coining the term ''AIOps'' encapsulated a projected solution to the problem – augment what humans can't (or don't have the time or capacity to do manually) with some amount of intelligent automation. Now, what an AIOps solution actually does to accomplish that goal is often left to a discerning user to interpret. So, let's demystify the term by not focusing on the term itself (let's leave that to the marketing folks), but rather articulating the kinds of things we would want to have this intelligent technology do to help us in our situation: • Autonomously inspect data and assess its relevance, importance, and notability based upon an automatically learned set of constraints, rules, and behavior. • Filter out the noise of irrelevant behaviors so as to not distract human analysts from the things that actually matter. • Obtain a certain amount of proactive early warnings regarding problems that may be brewing but have not necessarily caused an outage yet. • Automatically gather related/correlated evidence around a problem to assist with Root Cause Analysis (RCA). • Uncover operational inefficiencies in order to maximize infrastructure performance. • Suggest an action or next step for remediation, based upon past remediations and their effectiveness. While this list is in no way comprehensive, we can see the gist of what we're getting at here – which is that intelligent automation and analysis can pay big dividends and allow IT departments to drive efficiencies and thus maximize business outcomes. Except for the suggested remediations mentioned in number six in the preceding list (at least at this moment), Elastic Machine Learning (ML) can very much be an important part of all the other goals on this list. We've seen already how Elastic ML can automatically find anomalous behavior, forecast trends, proactively alert, and so on. But we must also recognize that Elastic ML is a generic ML platform – it is not purposebuilt for IT operations/observability or security analytics. As such, there still needs to be an orientation of how Elastic ML is used in the context of operations, and that will be discussed throughout this chapter.
206
AIOps and Root Cause Analysis
It is also important to note that there are still a large number of IT operation groups that currently use no intelligent automation and analysis. They often claim that they would like to employ an AI-based approach to improve their current situation, but that they are not quite ready to take the plunge. So, let's challenge the notion that the only way to benefit from AI is to do every single thing that is possible on day 1. Let's instead build up some practical applications of Elastic ML in the context of IT operations and how it can be used to satisfy most of the goals articulated in the preceding list. We will first start with the notion of the Key Performance Indicator (KPI) and why it is the logical choice for the best place to get started with Elastic ML.
Understanding the importance and limitations of KPIs Because of the problem of scale and the desire to make some amount of progress in making the collected data actionable, it is natural that some of the first metrics to be tackled for active inspection are those that are the best indicators of performance or operation. The KPIs that an IT organization chooses for measurement, tracking, and flagging can span diverse indicators, including the following: • Customer experience: These metrics measure customer experience, such as application response times or error rates. • Availability: Metrics such as uptime or Mean Time to Repair (MTTR) are often important to track. • Business: Here we may have metrics that directly measure business performance, such as orders per minute or number of active users. As such, these types of metrics are usually displayed, front and center, on most highlevel operational dashboards or on staff reports for employees ranging from technicians to executives. A quick Google image search for a KPI dashboard will return countless examples of charts, gauges, dials, maps, and other eye candy.
Understanding the importance and limitations of KPIs
207
While there is great value in such displays of information that can be consumed with a mere glance, there are still fundamental challenges with manual inspection: • Interpretation: There may be difficulty in understanding the difference between normal operation and abnormal, unless that difference is already intrinsically understood by the human. • Challenges of scale: Despite the fact that KPIs are already a distillation of all metrics down to a set of important ones, there still may be more KPIs to display than is feasible given the real estate of the screen that the dashboard is displayed upon. The end result may be crowded visualizations or lengthy dashboards that require scrolling/paging. • Lack of proactivity: Many dashboards such as this do not have their metrics also tied to alerts, thus requiring constant supervision if it's proactively known that a KPI that is faltering is important. • The bottom line is that KPIs are an extremely important step in the process of identifying and tracking meaningful indicators of the health and behavior of an IT system. However, it should be obvious that the mere act of identifying and tracking a set of KPIs with a visual-only paradigm is going to leave some significant deficiencies in the strategy of a successful IT operations plan. It should be obvious that KPIs are a great candidate for metrics that can be tracked with Elastic ML's anomaly detection. For example, say we have some data that looks like the following (from the it_ops_kpi sample dataset in the GitHub repo): { ''_index'' : ''it_ops_kpi'', ''_type'' : ''_doc'', ''_id'' : ''UqUsMngBFOh8A28xK-E3'', ''_score'' : 1.0, ''_source'' : { ''@timestamp'' : ''2021-01-29T05:36:09.000Z'', ''events_per_min'' : 28, ''kpi_indicator'' : ''online_purchases'' } },
208
AIOps and Root Cause Analysis
In this case, the KPI (the field called events_per_min) represents the summarized total number of purchases per minute for some online transaction processing system. We could easily track this KPI over time with an anomaly detection job with a sum function on the events_per_min field and a bucket span of 15 minutes. An unexpected dip in online sales (to a value of 921) is detected and flagged as anomalous:
Figure 7.1 – A KPI being analyzed with a typical anomaly detection job
Moving beyond KPIs
209
In this case, the KPI is just a single, overall metric. If there was another categorical field in the data that allowed it to be segmented (for example, sales by product ID, product category, geographical region, and so on), then ML could easily split the analysis along that field to expand the analysis in a parallel fashion (as we saw in Chapter 3, Anomaly Detection). But let's not lose sight of what we're accomplishing here: a proactive analysis of a key metric that someone likely cares about. The number of online sales per unit of time is directly tied to incoming revenue and thus is an obvious KPI. However, despite the importance of knowing that something unusual is happening with our KPI, there is still no insight as to why it is happening. Is there an operational problem with one of the backend systems that supports this customer-facing application? Was there a user interface coding error in the latest release that makes it harder for users to complete the transaction? Is there a problem with the third-party payment processing provider that is relied upon? None of these questions can be answered by merely scrutinizing the KPI. To get that kind of insight, we will need to broaden our analysis to include other sets of relevant and related information.
Moving beyond KPIs The process of selecting KPIs, in general, should be relatively easy, as it is likely obvious what metrics are the best indicators (if online sales are down, then the application is likely not working). But if we want to get a more holistic view of what may be contributing to an operational problem, we must expand our analysis beyond the KPIs to indicators that emanate from the underlying systems and technology that support the application.
210
AIOps and Root Cause Analysis
Fortunately, there are a plethora of ways to collect all kinds of data for centralization in the Elastic Stack. The Elastic Agent, for example, is a single, unified agent that you can deploy to hosts or containers to collect data and send it to the Elastic Stack. Behind the scenes, the Elastic Agent runs the Beats shippers or Elastic Endpoint required for your configuration. Starting from version 7.11, the Elastic Agent is managed in Kibana in the Fleet user interface and can be used to add and manage integrations for popular services and platforms:
Figure 7.2 – The Integrations section of the Fleet user interface in Kibana
Organizing data for better analysis
211
Using these different integrations, the user can easily collect data and centralize it in the Elastic Stack. While this chapter is not meant to be a tutorial on Fleet and the Elastic Agent, the important point is that regardless of what tools you use to gather the underlying application and system data, one thing is likely true: there will be a lot of data when all is said and done. Remember that our ultimate goal is to proactively and holistically pay attention to a larger percentage of the overall dataset. To do that, we must first organize this data so that we can effectively analyze it with Elastic ML.
Organizing data for better analysis One of the nicest things about ingesting data via the Elastic Agent is that by default, the data collected is normalized using the Elastic Common Schema (ECS). ECS is an open source specification that defines a common taxonomy and naming conventions across data that is stored in the Elastic Stack. As such, the data becomes easier to manage, analyze, visualize, and correlate across disparate data types – including across both performance metrics and log files. Even if you are not using the Elastic Agent or other legacy Elastic ingest tools (such as Beats and Logstash) and are instead relying on other, third-party data collection or ingest pipelines, it is still recommended that you conform your data to ECS because it will pay big dividends when users expect to use this data for queries, dashboards, and, of course, ML jobs. Note More information on ECS can be found in the reference section of the website at https://www.elastic.co/guide/en/ecs/current/ecsreference.html.
212
AIOps and Root Cause Analysis
Among many of the important fields within ECS is the host.name field, which defines which host the data was collected from. By default, most data collection strategies in the Elastic Stack involve putting data in indices that are oriented around the data type, and thus potentially contain interleaved documents from many different hosts. Perhaps some of our hosts in our environment support one application (that is, online purchases), but other hosts support a different application (such as invoice processing). With all hosts reporting their data into a single index, if we are interested in orienting our reporting and analysis of the data for one or both applications, it is obviously inappropriate to orient the analysis based solely on the index – we will need our analysis to be application-centric. In order to accomplish this, we have a few options: • Modifying the base query of the anomaly detection job so that it filters the data for only the hosts associated with the application of interest • Modifying the data on ingest to enrich it, to insert additional contextual information into each document, which will later be used to filter the query made by the anomaly detection job Both require customization of the datafeed query that the anomaly detection job makes to the raw data in the source indices. The first option may result in a relatively complex query and the second option requires an interstitial step of data enrichment using custom ingest pipelines. Let's briefly discuss each.
Custom queries for anomaly detection datafeeds When a new job is created in the anomaly detection UI, the first step is to choose either an index pattern or a Kibana saved search. If the former is chosen, then a {''match_ all'':{}} Elasticsearch query (return every record in the index) is invoked. If the job is created via the API or the advanced job wizard, then the user can specify just about any valid Elasticsearch DSL for filtering the data. Free-form composing Elasticsearch DSL can be a little error-prone for non-expert users. Therefore, a more intuitive way would be to approach this from Kibana via saved searches.
Organizing data for better analysis
213
For example, let's say that we have an index of log files and the appropriate hosts associated with the application we would like to monitor and analyze consist of two servers, esxserver1.acme.com and esxserver2.acme.com. On Kibana's Discover page, we can build a filtered query using KQL using the search box at the top of the user interface:
Figure 7.3 – Building a filtered query using KQL
The text of this KQL query would be as follows: physicalhost_name:''esxserver1.acme.com'' or physicalhost_ name:''esxserver2.acme.com''
214
AIOps and Root Cause Analysis
If you were curious about the actual Elasticsearch DSL that is invoked by Kibana to get this filtered query, you could click the Inspect button in the top right and select the Request tab to see the Elasticsearch DSL:
Figure 7.4 – Inspecting the Elasticsearch DSL that runs for the KQL filter
It is probably worth noting that despite the way the KQL query gets translated to Elasticsearch DSL in this specific example (using match_phrase, for example), it is not the only way to achieve the desired results. A query filter using terms is yet another way, but assessing the merits of one over the other is beyond the scope of this book. Regardless of the Elasticsearch DSL that runs behind the scenes, the key thing is that we have a query that filters the raw data to identify only the servers of interest for the application we would like to analyze with Elastic ML. To keep this filtered search, a click of the Save button in the top right and naming the search is necessary:
Organizing data for better analysis
215
Figure 7.5 – Saving the search for later use in Elastic ML
Later on, you could then select this saved search when configuring a new anomaly detection job:
Figure 7.6 – Leveraging a saved search in an anomaly detection job
As such, our ML job will now only run for the hosts of interest for this specific application. Thus, we have been able to effectively limit and segment the data analysis to the hosts that we've defined to have made a contribution to this application.
216
AIOps and Root Cause Analysis
Data enrichment on ingest Another option is to move the decision-making about which hosts belong to which applications further upstream to the time of ingest. If Logstash was part of the ingest pipeline, you could use a filter plugin to add additional fields to the data based upon a lookup against an asset list (file, database, and so on). Consult the Logstash documentation at https://www.elastic.co/guide/en/logstash/current/ lookup-enrichment.html, which shows you how to dynamically enrich the indexed documents with additional fields to provide context. If you were not using Logstash (merely using Beats/Elastic Agent and the ingest node), perhaps a simpler way would be to use the enrich processor instead. Consult the documentation at https://www. elastic.co/guide/en/elasticsearch/reference/current/ingestenriching-data.html. For example, you could have this enrichment add an application_name field and dynamically populate the value of this field with the appropriate name of the application, such as the following (truncated JSON here): ''host'': ''wasinv2.acme.com'', ''application_name'': ''invoice_processing'',
Or you could have the following: ''host'': ''www3.acme.com'', ''application_name'': ''online_purchases'',
Once the value of this field is set and inserted into the indexed documents, then you would use the application_name field, along with the ability to filter the query for the anomaly detection job (as previously described), to limit your data analysis to the pertinent application of interest. The addition of the data enrichment step may seem like a little more up-front effort, but it should pay dividends in the long term as it will be easier to maintain as asset names change or evolve, since the first method requires hardcoding the asset names into the searches of the ML jobs. Now that we have organized our data and perhaps even enriched it, let's now see how we can leverage that contextual information to make our anomaly detection jobs more effective.
Leveraging the contextual information With our data organized and/or enriched, the two primary ways we can leverage contextual information is via analysis splits and statistical influencers.
Leveraging the contextual information
217
Analysis splits We have already seen that an anomaly detection job can be split based on any categorical field. As such, we can individually model behavior separately for each instance of that field. This could be extremely valuable, especially in a case where each instance needs its own separate model. Take, for example, the case where we have data for different regions of the world:
Figure 7.7 – Differing data behaviors based on region
Whatever data this is (sales KPIs, utilization metrics, and so on), clearly it has very distinctive patterns that are unique to each region. In this case, it makes sense to split any analysis we do with anomaly detection for each region to capitalize on this uniqueness. We would be able to detect anomalies in the behavior that are specific to each region. Let's also imagine that, within each region, a fleet of servers support the application and transaction processing, but they are load-balanced and contribute equally to the performance/operation. In that way, there's nothing unique about each server's contribution to a region. As such, it probably doesn't make sense to split the analysis per server.
218
AIOps and Root Cause Analysis
We've naturally come to the conclusion that splitting by region is more effective than splitting by server. But what if a particular server within a region is having problems contributing to the anomalies that are being detected? Wouldn't we want to have this information available immediately, instead of having to manually diagnose further? This is possible to know via influencers.
Statistical influencers We introduced the concept of influencers in Chapter 5, Interpreting Results. As a reminder, an influencer is a field that describes an entity where you would like to know whether it ''influences'' (is to blame for) the existence of the anomaly or at least had a significant contribution. Remember that any field chosen as a candidate to be an influencer doesn't need to be part of the detection logic, although it is natural to pick fields that are used as splits to also be influencers. It is also important that influencers are chosen when the anomaly detection jobs are created as they cannot be added to the configuration later. It is also key to understand that the process of finding potential influencers happens after the anomaly detection job finds the anomaly. In other words, it does not affect any of the probability calculations that are made as part of the detection. Once the anomaly has been determined, ML will systematically go through all instances of each candidate influencer field and remove that instance's contribution to the data in that time bucket. If, once removed, the remaining data is no longer anomalous, then via counterfactual reasoning, that instance's contribution must have been influential and is scored accordingly (with an influencer_score value in the results). What we will see in the next section, however, is how influencers can be leveraged when viewing the results of not just a single anomaly detection job, but potentially several related jobs. Let's now move on to discuss the process of grouping and viewing jobs together to assist with RCA.
Bringing it all together for RCA We are at the point now where we can now discuss how we can bring everything together. In our desire to increase our effectiveness in IT operations and look more holistically at application health, we now need to operationalize what we've prepared in the prior sections and configure our anomaly detection jobs accordingly. To that end, let's work through a real-life scenario in which Elastic ML helped us get to the root cause of an operational problem.
Bringing it all together for RCA
219
Outage background This scenario is loosely based on a real application outage, although the data has been somewhat simplified and sanitized to obfuscate the original customer. The problem was with a retail application that processed gift card transactions. Occasionally, the app would stop working and transactions could not be processed. This would only be discovered when individual stores called headquarters to complain. The root cause of the issue was unknown and couldn't be ascertained easily by the customer. Because they never got to the root cause, and because the problem could be fixed by simply rebooting the application servers, the problem would randomly reoccur and plagued them for months. The following data was collected and included in the analysis to help understand the origins of the problem. This data included the following (and is supplied in the GitHub repo): • A summarized (1-minute) count of transaction volume (the main KPI) • Application logs (semi-structured text-based messages) from the transaction processing engine • SQL Server performance metrics from the database that backed the transaction processing engine • Network utilization performance metrics from the network the transaction processing engine operates on As such, four ML jobs were configured against the data. They were as follows: • it_ops_kpi: Using sum on the number of transactions processed per minute • it_ops_logs: Using count by the mlcategory detector to count the number of log messages by type, but using dynamic ML-based categorization to delineate different message types • it_ops_sql: Simple mean analysis of every SQL Server metric in the index • it_ops_network: Simple mean analysis of every network performance metric in the index
220
AIOps and Root Cause Analysis
These four jobs were configured and run on the data when the problem occurred in the application. Anomalies were found, especially in the KPI that tracked the number of transactions being processed. In fact, this is the same KPI that we saw at the beginning of this chapter, where an unexpected dip in order processing was the main indicator that a problem was occurring:
Figure 7.8 – The KPI of the number of transactions processed
However, the root cause wasn't understood until this KPI's anomaly was correlated with the anomalies in the other three ML jobs that were looking at the data in the underlying technology and infrastructure. Let's see how the power of visual correlation and shared influencers allowed the underlying cause to be discovered.
Bringing it all together for RCA
221
Correlation and shared influencers In addition to the anomaly in the transactions processed KPI (in which an unexpected dip occurs), the other three anomaly detection jobs (for the network metrics, the application logs, and the SQL database metrics) were superimposed onto the same time frame in the Anomaly Explorer. The following screenshot shows the results of this:
Figure 7.9 – Anomaly Explorer showing results of multiple jobs
222
AIOps and Root Cause Analysis
In particular, notice that the day the KPI was exhibiting problems (February 8, 2021, as shown in Figure 7.8), the three other jobs in Figure 7.9 exhibit correlated anomalies, shown by the circled area. Upon closer inspection (by clicking on the red tile for the it_ ops_sql job), you can see that there were issues with several of the SQL Server metrics going haywire at the same time:
Figure 7.10 – Anomaly Explorer showing anomalies for SQL Server
Bringing it all together for RCA
223
Note The shaded area of the charts is highlighting the window of time associated with the width of the selected tile in the swim lane. This window of time might be larger than the bucket span of the analysis (as is the case here) and therefore the shaded area can contain many individual anomalies during that time frame.
If we look at the anomalies in the anomaly detection job for the application log, there is an influx of errors all referencing the database (further corroborating an unstable SQL server):
Figure 7.11 – Anomaly Explorer showing anomalies for the application log
224
AIOps and Root Cause Analysis
However, interesting things were also happening on the network:
Figure 7.12 – Anomaly Explorer showing anomalies for the network data
Specifically, there was a large spike in network traffic (shown by the Out_Octets metric), and a high spike in packets getting dropped at the network interface (shown by the Out_ Discards metric). At this point, there was clear suspicion that this network spike might have something to do with the database problem. And, while correlation is not always causation, it was enough of a clue to entice the operations team to look back over some historical data from prior outages. In every other outage, this large network spike and packet drops pattern also existed.
Bringing it all together for RCA
225
The ultimate cause of the network spike was VMware's action of moving VMs to new ESX servers. Someone had misconfigured the network switch and VMware was sending this massive burst of traffic over the application VLAN instead of the management VLAN. When this occurred (randomly, of course), the transaction processing app would temporarily lose connection to the database and attempt to reconnect. However, there was a critical flaw in this reconnection code in that it would not attempt the reconnection to the database at the remote IP address that belonged to SQL Server. Instead, it attempted the reconnection to localhost (IP address 127.0.01), where, of course, there was no such database. The clue to this bug was seen in one of the example log lines that Elastic ML displayed in the Examples section (circled in the following screenshot):
Figure 7.13 – Anomaly Explorer showing the root cause of the reconnection problem
Once the problem occurred, the connection to SQL Server was therefore only possible if the application server was completely rebooted, the startup configuration files were reread, and the IP address of SQL Server was relearned. This was why a full reboot always fixed the problem.
226
AIOps and Root Cause Analysis
One key thing to notice is how the influencers in the user interface also assist with narrowing down the scope of who's at fault for the anomalies:
Figure 7.14 – Anomaly Explorer showing the top influencers
Summary
227
The top-scoring influencers over the time span selected in the dashboard are listed in the Top influencers section on the left. For each influencer, the maximum influencer score (in any bucket) is displayed, together with the total influencer score over the dashboard time range (summed across all buckets). And, if multiple jobs are being displayed together, then those influencers that are common across jobs have higher sums, thus pushing their ranking higher. This is a very key point because now it is very easy to see commonalities in offending entities across jobs. If esxserver1.acme.com is the only physical host that surfaces as an influencer when viewing multiple jobs, then we immediately know which machine to focus on; we know it is not a widespread problem. In the end, the customer was able to mitigate the system by both correcting the network misconfiguration and addressing the bug in the database reconnection code. They were able to narrow in on this root cause quite quickly because Elastic ML allowed them to narrow the focus of their investigation, thus saving time and preventing future occurrences.
Summary Elastic ML can certainly boost the amount of data that IT organizations pay attention to, and thus get more insight and proactive value out of their data. The ability to organize, correlate, and holistically view related anomalies across data types is critical to problem isolation and root cause identification. It reduces application downtime and limits the possibility of problem recurrence. In the next chapter, we will see how other apps within the Elastic Stack (APM, Security, and Logs) take advantage of Elastic ML to provide an out-of-the-box experience that's custom-tailored for specific use cases.
8
Anomaly Detection in Other Elastic Stack Apps When the first edition of this book was authored two years ago, there was no concept of other apps within the stack leveraging Elastic ML for domain-specific solutions. However, since then, Elastic ML has become a provider of anomaly detection for domain-specific solutions, providing tailor-made job configurations that users can enable with a single click. In this chapter, we will explore what Elastic ML brings to various Elastic Stack apps: • Anomaly detection in Elastic APM • Anomaly detection in the Logs app • Anomaly detection in the Metrics app • Anomaly detection in the Uptime app • Anomaly detection in the Elastic Security app
230
Anomaly Detection in Other Elastic Stack Apps
Technical requirements The information in this chapter is relevant as of v7.12 of the Elastic Stack.
Anomaly detection in Elastic APM Elastic APM takes application monitoring and performance management to a whole new level by allowing users to instrument their application code to get deep insights into the performance of individual microservices and transactions. In complex environments, this could generate a large number of measurements and poses a potentially paradoxical situation – one in which greater observability is obtained via this detailed level of measurement while possibly overwhelming the analyst who has to sift through the results for actionable insights. Fortunately, Elastic APM and Elastic ML are a match made in heaven. Anomaly detection not only automatically adapts to the unique performance characteristics of each transaction type via unsupervised machine learning, but it can also scale to handle the possibly voluminous amounts of data that APM can generate. While the user is always free to create anomaly detection jobs against any kind of timeseries data in any index, regardless of type, there is a compelling argument to simply provide pre-made, out-of-the-box job configurations on Elastic APM data, since the data format is already known.
Enabling anomaly detection for APM In order to take advantage of anomaly detection with your APM data, you need to obviously have some APM data collected for some declared services, and to have the data collected to be stored in indices accessible via the apm-* index pattern: 1. If you have not yet set up anomaly detection on your APM data, you will see an indicator at the top of the screen letting you know that it still needs to be set up:
Figure 8.1 – An indicator for when anomaly detection is not yet enabled on APM
Anomaly detection in Elastic APM
231
2. To demonstrate what the configuration of anomaly detection would look like, a trivial Hello World Node.js application was created and instrumented with Elastic APM. This application (called myapp) was also tagged with the environment tag of dev to signify that the application is a development app (this is all done within the APM agent for the Node.js configuration):
Figure 8.2 – Sample Node.js app instrumented with Elastic APM
3. When viewed inside of Elastic APM, the service will look as follows:
Figure 8.3 – Sample Node.js app viewed in the Elastic APM UI
232
Anomaly Detection in Other Elastic Stack Apps
4. To enable anomaly detection on this service, simply go to Settings, click on Anomaly detection, and then click on the Create ML Job button:
Figure 8.4 – Creating ML jobs for APM data
5. Next, specify the environment name that you want to build the job for:
Figure 8.5 – Specifying the environment to build the ML job for
6. Here we will select dev as it is the only choice available for our application. Once selected and after clicking the Create Jobs button, we get confirmation that our job was created, and it is listed in the table:
Anomaly detection in Elastic APM
233
Figure 8.6 – Listing of created anomaly detection jobs in APM
7. If we go to the Elastic ML app to see the details of the job that was created for us, we will see the following:
Figure 8.7 – Listing of created anomaly detection jobs in ML
234
Anomaly Detection in Other Elastic Stack Apps
8. Inspecting this further, we can see that the actual detector configuration for the job is the following:
Figure 8.8 – Detector configuration for the APM job
Notice that this leverages the high_mean function on the transaction. duration.us field, split both on transaction.type and partitioned on service.name. Also notice that the bucket_span value of the job is 15 minutes, which may or may not be the ideal setting for your environment. Note It should be noted that this configuration is the only one possible when using this one-click approach from the APM UI, as the configuration is hardcoded. If you want to customize or create your own configurations, you could either create them from scratch or possibly clone this job and set your own bucket span and/or detector logic.
9. From the perspective of the data that is being queried, we can click on the Datafeed tab to see the settings as follows:
Anomaly detection in Elastic APM
235
Figure 8.9 – Datafeed configuration for the APM job
Notice that we are only querying for documents that match "service. environment" : "dev" as we indicated when setting up the job in the APM UI. 10. We can click on the Datafeed preview tab to see a sample of the observations that will be fed to Elastic ML for this particular job, as shown in the following figure:
Figure 8.10 – Datafeed preview for the APM job
Now that the APM job is configured and running, let's turn our focus to where within the APM UI the anomaly detection job's results will be reflected.
236
Anomaly Detection in Other Elastic Stack Apps
Viewing the anomaly detection job results in the APM UI In addition to viewing the job results in the Elastic ML UI, there are three key places in which the results of anomaly detection jobs are manifested in the APM UI: • Services overview: This view in the APM UI gives a high-level health indicator that corresponds to the max anomaly score of the anomaly detection results for that particular service:
Figure 8.11 – Services overview in the APM UI
• Service map: This dynamic view shows application transaction dependencies, with a color-coded anomaly indicator based on the max anomaly score of the anomaly detection results for that particular service:
Anomaly detection in Elastic APM
Figure 8.12 – Service map in the APM UI
• Transaction duration chart: This chart, under the main Transactions view in APM, shows the expected bounds and a colored annotation when the anomaly score is over 75 for a particular transaction type:
Figure 8.13 – Transaction duration chart in the APM UI
237
238
Anomaly Detection in Other Elastic Stack Apps
These helpful indicators of anomalousness guide the user to investigate further and troubleshoot problems using the deep capabilities of the APM and ML UIs. Let's see one more way to integrate Elastic ML with data from APM, using the data recognizer.
Creating ML Jobs via the data recognizer While "the data recognizer" isn't an official marketing name of an actual feature in Elastic ML, it can in fact be very helpful in assisting the user to create pre-configured jobs for data that is recognized. Note Pre-configured jobs for the data recognizer are defined and stored in the following GitHub repository: https://github.com/elastic/
kibana/tree/master/x-pack/plugins/ml/server/ models/data_recognizer/modules.
Essentially, the workflow for creating a new job using the recognizer is as follows: If, during the process of creating an anomaly detection job, the data selected for the job's input (index pattern or saved search) matches search patterns known to one of the pre-defined recognizer modules, then you can offer the user the ability to create one of those pre-defined jobs. We can see an example of this using the trivial Node.js example from earlier in this chapter. Instead of creating the anomaly detection job from the APM UI, we can instead go through the Elastic ML UI and select the apm-* index pattern: • By selecting the index pattern, we will see two preconfigured jobs offered to us in addition to the normal job wizards:
Anomaly detection in Elastic APM
239
Figure 8.14 – Jobs offered from the data recognizer
• The first "APM" job is the same type of job we already created earlier in this chapter from the APM UI. The second option (APM: Node.js) is actually a collection of three jobs:
Figure 8.15 – Node.js jobs offered from the data recognizer
240
Anomaly Detection in Other Elastic Stack Apps
• The third is, yet again, the same type of job we already created earlier in this chapter from the APM UI, but the other two are unique. This notion of offering the users suggested jobs if the source data is "recognized" is not unique to APM data and you may see those suggested jobs in other situations or use cases (such as choosing the indices of the sample Kibana data, data from nginx, and so on). Now that we've seen how Elastic ML is be embedded into the APM app, let's keep exploring to find out how anomaly detection is leveraged by the Logs app.
Anomaly detection in the Logs app The Logs app inside of the Observability section of Kibana offers a similar view of your data as the Discover app. However, the users who appreciate more of a live tail view of their logs, regardless of the index the data is stored, will love the Logs app:
Figure 8.16 – The Logs app, part of the Observability section of Kibana
Notice that there is both an Anomalies tab and a Categories tab. Let's first discuss the Categories section.
Log categories Elastic ML's categorization capabilities, first shown back in Chapter 3, Anomaly Detection, are applied in a generic way to any index of unstructured log data. Within the Logs app, however, categorization is employed with some more strict constraints on the data. In short, the data is expected to be in Elastic Common Schema (ECS) with certain fields defined (especially a field called event.dataset).
Anomaly detection in the Logs app
241
Note The logs dataset from Chapter 7, AIOps and Root Cause Analysis, is duplicated for this chapter in the GitHub repository for use with the Logs app, with the addition of the event.dataset field. If you're importing via the file upload facility in ML, be sure to override the field name to be event.dataset instead of the default event_dataset that will be offered.
One can imagine the reason behind this constraint, given that the Logs app is trying to create the categorization job for you in a templated way. Therefore, it needs to be certain of the naming convention of the fields. This would obviously not be the case if this were to be done in the ML app, where the onus is on the user to declare the names of the categorization field and the message field. If you do, however, configure the Logs app to invoke categorization, then the output of this will look something like the following figure, which shows each distinct log category, sorted by the maximum anomaly score:
Figure 8.17 – The Logs app displaying categorization results from Elastic ML
Users can then click the Analyze in ML button to navigate over to the Anomaly Explorer in the ML UI for further inspection.
242
Anomaly Detection in Other Elastic Stack Apps
Log anomalies The Anomalies section of the Logs app provides a view similar to that of the Anomaly Explorer:
Figure 8.18 – The Logs app displaying a view similar to ML's Anomaly Explorer
It also allows the user to manage the anomaly detection jobs if the Manage ML jobs button is clicked:
Figure 8.19 – The Logs app allowing the management of ML jobs
Anomaly detection in the Logs app
243
The Log rate job displayed on the left-hand side of Figure 8.18 is just a simple count detector, partitioned on the event.dataset field. Users should note that these ML jobs can be recreated here, but not permanently deleted – you must go to the anomaly detection job management page in the ML app to delete the jobs. Obviously, an expert user might just resort to creating and managing their anomaly detection jobs for their logs within the ML app. But it is nice that the Logs app does surface Elastic ML's capabilities in a way that makes the functionality obvious and easy to implement. Let's continue this trend to see how Elastic ML is utilized in the Metrics app.
Anomaly detection in the Metrics app The Metrics app, also part of the Observability section of Kibana, allows users to have an inventory and metrics-driven view of their data: • In the Inventory view, users see an overall map of monitored resources. Entities such as hosts, pods, or containers can be organized and filtered to customize the view – including a color-coded health scale as shown in the following figure:
Figure 8.20 – The Metrics app showing the Inventory view
244
Anomaly Detection in Other Elastic Stack Apps
Notice that the bottom panel would display anomalies, if detected. This is currently the only place to view anomalies within the Metrics app. • To enable the built-in anomaly detection jobs via the Metrics app, click the Anomaly detection button at the top to enable the configuration flyout:
Figure 8.21 – The Metrics app showing the management of anomaly detection jobs
• If, for example, we choose to enable anomaly detection for hosts, clicking the appropriate Enable button will show this:
Anomaly detection in the Logs app
245
Figure 8.22 – The Metrics app showing the config of anomaly detection on hosts
• Notice that a partition field is offered as part of the configuration, if desired. When you click the Enable jobs button, three different anomaly detection jobs are created for you on your behalf, viewable in Elastic ML's Job Management section:
Figure 8.23 – The anomaly detection jobs for hosts created by the Metrics app
246
Anomaly Detection in Other Elastic Stack Apps
• To set the minimum anomaly score required to have the Metrics app display anomalies for you, head on over to the Settings tab and configure the Anomaly Severity Threshold setting:
Figure 8.24 – The Anomaly Severity Threshold setting within the Metrics app
The Elastic ML integration with the Metrics app is quite simple and straightforward – it allows the user to quickly get started using anomaly detection on their metric data. Let's now take a quick look at the integration in the Uptime app.
Anomaly detection in the Uptime app The Uptime app allows simple availability and response time monitoring of services via a variety of network protocols, including HTTP/S, TCP, and ICMP: 1. Often classified as synthetic monitoring, the Uptime app uses Heartbeat to actively probe network endpoints from one or more locations:
Figure 8.25 – The Uptime app in Kibana
Anomaly detection in the Uptime app
247
2. If you would like to enable anomaly detection on a monitor, simply click on the monitor name to see the monitor detail. Within the Monitor duration panel, notice the Enable anomaly detection button:
Figure 8.26 – Enabling anomaly detection for an Uptime monitor
248
Anomaly Detection in Other Elastic Stack Apps
3. Clicking on the Enable anomaly detection button creates the job in the background and offers the user the option to create an alert for anomalies surfaced by the job:
Figure 8.27 – Creating an alert on the anomaly detection job in the Uptime app
4. Once the anomaly detection job is available, any anomalies discovered will also be displayed within the monitor's detail page in the Monitor duration panel:
Anomaly detection in the Elastic Security app
249
Figure 8.28 – Anomalies displayed in the Uptime app
Again, the integration of Elastic ML with another one of the Observability applications in the stack makes it incredibly easy for users to take advantage of sophisticated anomaly detection. But we also know that Elastic ML can do some interesting things with respect to population and rare analysis. The integration of ML with the Elastic SIEM is in store for you in the next section – let's get detecting!
Anomaly detection in the Elastic Security app Elastic Security is truly the quintessence of a purpose-driven application in the Elastic Stack. Created from the ground up with the security analyst's workflow in mind, the comprehensiveness of the Elastic Security app could fill an entire book on its own. However, the heart of the Elastic Security app is the Detections feature in which user- and Elastic-created rules execute to create alerts when rules' conditions are met. As we'll see, Elastic ML plays a significant role in the Detections feature.
250
Anomaly Detection in Other Elastic Stack Apps
Prebuilt anomaly detection jobs The majority of the detection rules in Elastic Security are static, but many are backed by prebuilt anomaly detection jobs that operate on the data collected from Elastic Agent or Beats, or equivalent data that conforms with the ECS fields that are applicable for each job type. To see a comprehensive list of anomaly detection jobs supplied by Elastic, view the datafeed and job configuration definition in the security_* and siem_* folders in the following GitHub repository: https://github.com/elastic/kibana/ tree/7.12/x-pack/plugins/ml/server/models/data_recognizer/ modules. (You can add the latest release version number in the spot currently occupied by 7.12.) An astute reader will notice that many of the prebuilt jobs leverage either population analysis or the rare detector. Each of these styles of anomaly detection is well aligned with the goals of the security analyst – where finding novel behaviors and/or behaviors that make users or entities different from the crowd is often linked to an indicator of compromise. The prebuilt anomaly detection jobs are viewable in the Detections tab of Elastic Security and have the ML tag:
Figure 8.29 – ML jobs in the Detection rules section of the Security app
Anomaly detection in the Elastic Security app
251
Clicking on ML job settings at the top right of the screen will expose the settings list, where the user can see all of the jobs in the library – even the ones that are not available (marked with a warning icon):
Figure 8.30 – All jobs in the ML job settings section of the Security app
252
Anomaly Detection in Other Elastic Stack Apps
Jobs are marked as unavailable if the necessary data for the job to use is not currently indexed in Elasticsearch. If the job is available, you can activate the job by clicking the toggle switch and the anomaly detection job will be provisioned in the background. Of course, you can always view the jobs created in the Elastic ML Job Management UI:
Figure 8.31 – Elastic ML jobs created by Security as a view in the Job Management UI
Now that we know how to enable anomaly detection jobs, it is time to see how they create detection alerts for the Security app.
Anomaly detection in the Elastic Security app
253
Anomaly detection jobs as detection alerts As we return to the Detections view shown back in Figure 8.28, if you click on the Create new rule button, we can see that we can select Machine Learning as the rule type:
Figure 8.32 – Creating an ML-based detection rule
254
Anomaly Detection in Other Elastic Stack Apps
If you were to select a specific machine learning job using the drop-down list, you would also be asked to select the Anomaly score threshold value that needs to be exceeded to trigger the detection alert:
Figure 8.33 – Selecting the ML job and score threshold for the detection rule
Obviously, if the job is not currently running, a warning will indicate that the job needs to be started. It is also notable that if you created your own custom job (that is, not using one of the prebuilt jobs) but assigned it to a Job Group called "security" in the Elastic ML app, that custom job would also be a candidate to be chosen within the drop-down box in this view. The remainder of the detection rule configuration can be left to the reader – as the rest is not specific to machine learning. The Elastic Security app clearly relies heavily on anomaly detection to augment traditional static rules with dynamic, user/entity behavioral analysis that can surface notable events that are often worthy of investigation. It will be interesting to see how far and how broadly the machine learning capabilities continue to bolster this emerging use case as time goes on!
Summary
255
Summary Elastic ML has clearly infiltrated many of the other apps in the Elastic Stack, bringing easy-to-use functionality to users' fingertips. This proves how much Elastic ML is really a core functionality to the Stack itself, akin to other key stack features such as aggregations. Congratulations, you have reached the end of the first half of this book, and hopefully you feel well armed with everything that you need to know about Elastic ML's anomaly detection. Now, we will venture into the "other side" of Elastic ML – data frame analytics – where you will learn how to bring other machine learning techniques (including supervised-based model creation and inferencing) to open up analytical solutions to a vast new array of use cases.
Section 3 – Data Frame Analysis
This section covers the basics of data frame analysis, what it is useful for, and how you will be able to implement it. It covers various types of analysis and how they differ from each other. At the end, the section covers some useful tidbits that can be used to get the most out of Elastic ML. This section covers the following topics: • Chapter 9, Introducing Data Frame Analysis • Chapter 10, Outlier Detection • Chapter 11, Classification Analysis • Chapter 12, Regression • Chapter 13, Inference • Appendix: Anomaly Detection Tips
9
Introducing Data Frame Analytics In the first section of this book, we took an in-depth tour of anomaly detection, the first machine learning capability to be directly integrated into the Elastic Stack. In this chapter and the following one, we will take a dive into the new machine learning features integrated into the stack. These include outlier detection, a novel unsupervised learning technique for detecting unusual data points in non-timeseries indices, as well as two supervised learning features, classification and regression. Supervised learning algorithms use labeled datasets – for example, a dataset describing various aspects of tissue samples along with whether or not the tissue is malignant – to learn a model. This model can then be used to make predictions on previously unseen data points (or tissue samples, to continue our example). When the target of prediction is a discrete variable or a category such as a malignant or non-malignant tissue sample, the supervised learning technique is called classification. When the target is a continuous numeric variable, such as the sale price of an apartment or the hourly price of electricity, the supervised learning technique is known as regression. Collectively, these three new machine learning features are known as Data Frame Analytics. We will discuss each of these in more depth in the following chapters.
260
Introducing Data Frame Analytics
Although each of these solves a different problem and has a different purpose, they are all powered under the hood by a common data transformation technology, that of transforms, which enables us to transform and aggregate data from a transaction- or stream-based format into an entity-based format. This entity-centric format is required by many of the algorithms we use in Data Frame Analytics and thus, before we dive deeper into each of the new machine learning features, we are going to dedicate this chapter to understanding in depth how to use transforms to transform our data into a format that is more amenable for downstream machine learning technologies. While on this journey, we are also going to take a brief tour of Painless, the scripting language embedded into Elasticsearch, which is a good tool for any data scientists or engineers working with machine learning in the Elastic Stack. A rich ecosystem of libraries, both for data manipulation and machine learning, exists outside of the Elastic Stack as well. One of the main drivers powering these applications is Python. Because of its ubiquity in the data science and data engineering communities, we are going to focus, in the second part of this chapter, on using Python with the Elastic Stack, with a particular focus on the new data-science native Elasticsearch client, Eland. We'll check out the following topics in this chapter: • Learning to use transforms • Using Painless for advanced transform configurations • Working with Python and Elasticsearch
Technical requirements The material in this chapter requires Elasticsearch version 7.9 or above and Python 3.7 or above. Code samples and snippets required for this chapter will be added under the folder Chapter 9 - Introduction to Data Frame Analytics in the book's GitHub repository (https://github.com/PacktPublishing/ Machine-Learning-with-Elastic-Stack-Second-Edition/tree/ main/Chapter%209%20-%20Introduction%20to%20Data%20Frame%20 Analytics). In such cases where some examples require a specific newer release of Elasticsearch, this will be mentioned before the example is presented.
Learning how to use transforms In this section, we are going to dive right into the world of transforming stream or eventbased data, such as logs, into an entity-centric index.
Learning how to use transforms
261
Why are transforms useful? Think about the most common data types that are ingested into Elasticsearch. These will often be documents recording some kind of time-based or sequential event, for example, logs from a web server, customer purchases from a web store, comments published on a social media platform, and so forth. While this kind of data is useful for understanding the behavior of our systems over time and is perfect for use with technologies such as anomaly detection, it is harder to make stream- or event-based datasets work with Data Frame Analytics features without first aggregating or transforming them in some way. For example, consider an e-commerce store that records purchases made by customers. Over a year, there may be tens or hundreds of transactions for each customer. If the e-commerce store then wants to find a way to use outlier detection to detect unusual customers, they will have to transform all of the transaction data points for each customer and summarize certain key metrics such as the average amount of money spent per purchase or number of purchases per calendar month. In Figure 9.1, we have a simplified illustration that depicts the process of taking transaction records from e-commerce purchases made by two customers and converting them into an entity-centric index that describes the total quantity of items purchased by these customers, as well as the average price paid per order.
Figure 9.1 – A diagram illustrating the process of taking e-commerce transactions and transforming them into an entity-centric index
262
Introducing Data Frame Analytics
In order to perform the transformation depicted in Figure 9.1, we have to group each of the documents in the transaction index by the name of the customer and then perform two computations: sum up the quantity of items in each transaction document to get a total sum and also compute the average price of purchases for each customer. Doing this manually for all of the transactions for each of the thousands of potential customers would be extremely arduous, which is where transforms come in.
The anatomy of a transform Although we are going to start off our journey into transforms with simple examples, many real-life use cases can very quickly get complicated. It is useful to keep in mind two things that will help you keep your bearing as you apply transforms to your own data projects: the pivot and the aggregations. Let's examine how these two entities complement each other to help us transform a stream-based document into an entity-centric index. In our customer analytics use case, we have many different features describing each customer: the name of the customer, the total price they paid for each of their products at checkout, the list of items they purchased, the date of the purchase, the location of the customer, and so forth. The first thing we want to pick is the entity for which we are going to construct our entitycentric index. Let's start with a very simple example and say that our goal is to find out how much each customer spent on average per purchase during our time period and how much they spent in total. Thus, the entity we want to construct the index for – our pivot – is the name of the customer. Most of the customers in our source index have more than one transaction associated with them. Therefore, if we try to group our index by customer name, for each customer we will have multiple documents. In order to pivot successfully using this entity, we need to decide which aggregate quantities (for example, the average price per order paid by the customer) we want to bring into our entity-centric index. This will, in turn, determine which aggregations we will define in our transform configuration. Let's take a look at how this works out with a practical example.
Using transforms to analyze e-commerce orders In this section, we will use the Kibana E-Commerce sample dataset to illustrate some of the basic transformation concepts outlined in the preceding section: 1. Import the Sample eCommerce orders dataset from the Kibana Sample data panel displayed in Figure 9.2 by clicking the Add data button. This will create a new index called kibana_sample_data_ecommerce and populate it with the dataset.
Learning how to use transforms
263
Figure 9.2 – Import the Sample eCommerce orders dataset from the Kibana Sample data panel
2. Navigate to the Transforms wizard by bringing up the Kibana slide-out panel menu from the hamburger button in the top left-hand corner, navigating to Stack Management, and then clicking Transforms under the Data menu. 3. In the Transforms view, click Create your first transform to bring up the Transforms wizard. This will prompt you to choose a source index – this is the index that the transform will use to create your pivot and aggregations. In our case, we are interested in the kibana_sample_data_ecommerce index, which you should select in the panel shown in Figure 9.3. The source indices displayed in your Kibana may look a bit different depending on the indices currently available in your Elasticsearch cluster.
Figure 9.3 – For this tutorial, please select kibana_sample_data_ecommerce
264
Introducing Data Frame Analytics
4. After selecting our source index, the Transform wizard will open a dialog that shows us a preview of our source data (Figure 9.4), as well as allowing us to select our pivot entity using the drop-down selector under Group by. In this case, we want to pivot on the field named customer_full_name.
Figure 9.4 – Select the entity you want to pivot your source index by in the Group by menu
5. Now that we have defined the entity to pivot our index by, we will move on to the next part in the construction of a transform: the aggregations. In this case, we are interested in figuring out the average amount of money the customer spent in the e-commerce store per order. During each transaction, which is recorded in a document in the source index, the total amount paid by the customer is stored in the field taxful_total_price. Therefore, the aggregation that we define will operate on this field. In the Aggregations menu, select taxful_total_price.avg. Once you have clicked on this selection, the field will appear in the box under Aggregations and you will see a preview of the pivoted index as shown in Figure 9.5.
Learning how to use transforms
265
Figure 9.5 – A preview of the transformed data is displayed to allow a quick check that everything is configured as desired.
6. Finally, we will configure the last two items: an ID for the transform job, and the name of the destination index that will contain the documents that describe our pivoted entities. It is a good idea to leave the Create index pattern checkbox checked as shown in Figure 9.6 so that you can easily navigate to the destination index in the Discover tab to view the results.
Figure 9.6 – Each transform needs a transform ID
266
Introducing Data Frame Analytics
The transform ID will be used to identify the transform job and a destination index that will contain the documents of the entity-centric index that is produced as a result of the transform job. 7. To start the transform job, remember to click Next in the Transform wizard after completing the instructions described in step 6, followed by Create and start. This will launch the transform job and create the pivoted, entity-centric index. 8. After the transform has completed (you will see the progress bar reach 100% if all goes well), you can click on the Discover button at the bottom of the Transform wizard and view your transformed documents. As we discussed at the beginning of this section, we see from a sample document in Figure 9.7 that the transform job has taken a transaction-centric index, which recorded each purchase made by a customer in our e-commerce store and transformed it into an entity-centric index that describes a specific analytical transformation (the calculation of the average price paid by the customer) grouped by the customer's full name.
Figure 9.7 – The result of the transform job is a destination index where each document describes the aggregation per each pivoted entity. In this case, the average taxful_total_price paid by each customer
Learning how to use transforms
267
Congratulations – you have now created and started your first transform job! Although it was fairly simple in nature, this basic job configuration is a good building block to use for more complicated transformations, which we will take a look at in the following sections.
Exploring more advanced pivot and aggregation configurations In the previous section, we explored the two parts of a transform: the pivot and the aggregations. In the subsequent example, our goal was to use transforms on the Kibana sample eCommerce dataset to find out the average amount of money our customers spent per order. To solve this problem, we figured out that each document that recorded a transaction had a field called customer.full_name and we used this field to pivot our source index. Our aggregation was an average of the field that recorded the total amount of money spent by the customer on the order. However, not all questions that we might want to ask of our e-commerce data lend themselves to simple pivot or group by configurations like the one discussed previously. Let's explore some more advanced pivot configurations that are possible with transforms, with the help of some sample investigations we might want to carry out on the e-commerce dataset. If you want to discover all of the available pivot configurations, take a look at the API documentation for the pivot object at this URL: https:// www.elastic.co/guide/en/elasticsearch/reference/master/ put-transform.html. Suppose that we would like to find out the average amount of money spent per order per week in our dataset, and how many unique customers made purchases. In order to answer these questions, we will need to construct a new transform configuration: 1. Instead of pivoting by the name of the customer, we want to construct a date histogram from the field order_date, which, as the name suggests, records when the order was placed. The Transform wizard makes this simple since date_ histogram(order_date) will be one of the pre-configured options displayed in the Group by dropdown.
268
Introducing Data Frame Analytics
2. Once you have selected date_histogram(order_date) in the Group by dropdown, direct your attention to the right-hand side of the panel as shown in Figure 9.8. The right-hand side should contain an abbreviation for the length of the grouping interval used in the date histogram (for example 1m for an interval of 1 minute). In our case, we are interested in pivoting our index by weeks, so we need to choose 1w from the dropdown.
Figure 9.8 – Adjust the frequency of the date histogram from the dropdown
3. Next, for our aggregation, let's choose the familiar avg(total_taxful_price). After we have made our selection, the Transform wizard will display a preview, which will show the average price paid by a customer per order, grouped by different weeks for a few sample rows. The purpose of the preview is to act as a checkpoint. Since transform jobs can be resource-intensive, at this stage it is good to pause and examine the preview to make sure the data is transformed into a format that you desire.
Learning how to use transforms
269
4. Sometimes we might want to interrogate our data in ways that do not lend themselves to simple one-tiered group-by configurations like the one we explored in the preceding steps. It is possible to nest group-by configurations, as we will see in just a moment. Suppose that in our hypothetical e-commerce store example, we would also be interested in seeing the average amount of money spent by week and by geographic region. To solve this, let's go back to the Transform wizard and add a second group-by field. In this case, we want to group by geoip.region_name. As before, the wizard shows us a preview of the transform once we select the group-by field. As in the previous case, it is good to take a moment to look at the rows displayed in the preview to make sure the data has been transformed in the desired way. Tip Click on the Columns toggle above the transform preview table to rearrange the order of the columns.
In addition to creating multiple group-by configurations, we can also add multiple aggregations to our transform. Suppose that in addition to the average amount of money spent per customer per week and per region, we would also be interested in finding out the number of unique customers who placed orders in our store. Let's see how we can add this aggregation to our transform. 5. In the Aggregations drop-down menu in the wizard, scroll down until you find the entity cardinality (customer.full_name.keyword) and click on it to select it. The resulting aggregation will be added to your transform configuration and the preview should now display one additional column. You can now follow the steps outlined in the tutorial of the previous section to assign an ID and a destination index for the transform, as well as to create and start the job. These will be left as exercises for you. In the previous two sections, we examined the two key components of transforms: the pivot and aggregations, and did two different walk-throughs to show how both simple and advanced pivot and aggregation combinations can be used to interrogate our data for various insights. While following the first transform, you may have noticed that in Figure 9.6, we left the Continuous mode checkbox unchecked. We will take a deeper look at what it means to run a transform in continuous mode in the next section.
270
Introducing Data Frame Analytics
Discovering the difference between batch and continuous transforms The first transform we created in the previous section was simple and ran only once. The transform job read the source index kibana_sample_data_ecommerce, which we configured in the Transform wizard, performed the numerical calculations required to compute the average price paid by each customer, and then wrote the resulting documents into a destination index. Because our transform runs only once, any changes to our source index kibana_sample_data_ecommerce that occur after the transform job runs will no longer be reflected in the data in the destination index. This kind of transform that runs only once is known as a batch transform. In many real-world use cases that produce records of transactions (like in our fictitious e-commerce store example), new documents are being constantly added to the source index. This means that our pivoted entity-centric index that we obtained as a result of running the transform job would be almost immediately out of date. One solution to keep the destination index in sync with the source index is to keep deleting the destination index and rerunning the batch transform job at regular intervals. This, however, is not practical and requires a lot of manual effort. This is where continuous transforms step in. If we have a source index that is being updated and we want to use that to create a pivoted entity-centric index, then we have to use a continuous transform instead of a batch transform. Let's explore continuous transforms in a bit more detail to understand how they differ from batch transforms and what important configuration parameters should be considered when running a continuous transform. First, let's set the stage for the problem we are trying to solve. Suppose we have a fictitious microblogging social media platform, where users post short updates, assign categories to the updates and interact with other users as well as predefined topics. It is possible to share a post and like a post. Statistics for each post are recorded as well. We have written some Python code to help generate this dataset. This code and accompanying instructions for how to run this code are available under the Chapter 9 - Introduction to Data Frame Analytics folder in the GitHub repository for this book (https://github. com/PacktPublishing/Machine-Learning-with-Elastic-StackSecond-Edition/tree/main/Chapter%209%20-%20Introduction%20 to%20Data%20Frame%20Analytics). After running the generator, you will have an index called social-media-feed that will contain a number of documents.
Learning how to use transforms
271
Each document in the dataset records a post that the user has made on the social media platform. For the sake of brevity, we have excluded the text of the post from the document. Figure 9.9 shows a sample document in the social-media-feed index.
Figure 9.9 – A sample document in the social-media-feed index records the username, the time the post was submitted to the platform, as well as some basic statistics about the engagement the post received
In the next section, we will see how to use this fictional social media platform dataset to learn about continuous transforms.
Analyzing social media feeds using continuous transforms In this section, we will be using the dataset introduced previously to explore the concept of continuous transforms. As we discussed in the previous section, batch transforms are useful for one-off analyses where we are either happy to analyze a snapshot of the dataset at a particular point in time or we do not have a dataset that is changing. In most real-world applications, this is not the case. Log files are continuously ingested, many social media platforms have around the clock activity, and e-commerce platforms serve customers across all time zones and thus generate a stream of transaction data. This is where continuous transforms step in.
272
Introducing Data Frame Analytics
Let's see how we can analyze the average level of engagement (likes and shares) received by a social media user using continuous transforms: 1. Navigate to the Transforms wizard. On the Stack Management page, look to the left under the Data section and select Transforms. 2. Just as we did in the previous sections, let's start by creating the transform. For the source index, select the social-media-feed index pattern. This should give you a view similar to the one in Figure 9.10.
Figure 9.10 – The Transforms wizard shows a sample of the social-media-feed index
3. In this case, we will be interested in computing aggregations of the engagement metrics of each post per username. Therefore, our Group by configuration will include the username, while our aggregations will compute the total likes and shares per user, the average likes and shares per user as well as the total number of posts each user has made. The final Group by and Aggregations configurations should look something like Figure 9.11.
Learning how to use transforms
273
Figure 9.11 – Group by and Aggregations configuration for our continuous transform
4. Finally, tick the Continuous mode selector and confirm that Date field is selected correctly as timestamp as shown in Figure 9.12.
Figure 9.12 – Select Continuous mode to make sure the transform process periodically checks the source index and incorporates new documents into the destination index
274
Introducing Data Frame Analytics
5. Once you click Create and start, you can return to the Transforms page and you will see the continuous transforms job for the social-media-feed index running. Note the continuous tag in the job description.
Figure 9.13 – Continuous transforms shown on the Transforms page. Note that the mode is tagged as continuous
6. Let's insert some new posts into our index social-media-feed and see how the statistics for the user Carl change after a new document is added to the source index for the transform. To insert a new post, open the Kibana Dev Console and run the following REST API command (see chapter9 in the book's GitHub repository for a version that you can easily copy and paste into your own Kibana Dev Console if you are following along): POST social-media-feed/_doc { "username": "Carl", "statistics": { "likes": 320, "shares": 8000 }, "timestamp": "2021-01-18T23:19:06" }
Learning how to use transforms
275
7. Now, that we have added a new document into the source index socialmedia-feed, we expect that this document will be picked up by the continuous transform job and incorporated into our transform's destination index, socialmedia-feed-engagement. Figure 9.14 showcases the transformed entry for the username Carl.
Figure 9.14 – The destination index of the continuous transform job holds an entry for the new username Carl, which we added manually through the Kibana Dev Console
The preceding example gives a very simplified walk-through of how continuous transforms work and how you can create your own continuous transform using the Transforms wizard available in Kibana. In Chapter 13, Inference, we will return to the topic of continuous transforms when we showcase how to combine trained machine learning models, inference, and transforms. For now, we will take a brief detour into the world of the scripting language Painless. While the Transforms wizard and the many pre-built Group by and Aggregations configurations that it offers suffice for many of the most common data analysis use cases, more advanced users will wish to define their own aggregations. A common way to do this is with the aid of the Elasticsearch embedded scripting language, Painless. In the next section, we will take a little tour of the Painless world, which will prepare you for creating your own advanced transform configurations.
276
Introducing Data Frame Analytics
Using Painless for advanced transform configurations As we have seen in many of the previous sections, the built-in pivot and aggregation options allow us to analyze and interrogate our data in various ways. However, for more custom or advanced use cases, the built-in functions may not be flexible enough. For these use cases, we will need to write custom pivot and aggregation configurations. The flexible scripting language that is built into Elasticsearch, Painless, allows us to do this. In this section, we will introduce Painless, illustrate some tools that are useful when working with Painless, and then show how Painless can be applied to create custom Transform configurations.
Introducing Painless Painless is a scripting language that is built into Elasticsearch. We will take a look at Painless in terms of variables, control flow constructs, operations, and functions. These are the basic building blocks that will help you develop your own custom scripts to use with transforms. Without further ado, let's dive into the introduction. It is likely that many readers of this book come from a sort of programming language background. You may have written data cleaning scripts with Python, programmed a Linux machine with bash scripts, or developed enterprise software with Java. Although these languages have many differences and are useful for different purposes, they all have shared building blocks that help human readers of the language understand them. Although there is an almost infinite number of approaches to teaching a programming language, the approach we will take here will be based on understanding the following basic topics about Painless: variables, operations (such as addition, subtraction, and various Boolean tests), control flow (if-else constructs and for loops) and functions. These are analogous concepts that users familiar with another programming language should be able to relate to. In addition to these concepts, we will be looking at some aspects that are particular to Painless, such as different execution contexts. When learning a new programming language, it is important to have a playground that can be used to experiment with syntax. Luckily, with the 7.10 version of Elasticsearch, the Dev Tools app now contains the new Painless Lab playground, where you can try out the code samples presented in this chapter as well as any code samples you write on your own.
Using Painless for advanced transform configurations
277
The Painless Lab can be accessed by navigating to Dev Tools as shown in Figure 9.15 and then, in the top menu of the Dev Tools page, selecting Painless Lab.
Figure 9.15 – The link to the Dev Tools page is located in the lower section of the Kibana side menu. Select it to access the interactive Painless lab environment
278
Introducing Data Frame Analytics
This will open an embedded Painless code editor as shown in Figure 9.16.
Figure 9.16 – The Painless Lab in Dev Tools features an embedded code editor. The Output window shows the evaluation result of the code in the code editor
The code editor in the Painless Lab is preconfigured with some sample functions and variable declarations to illustrate how one might draw the figure in the Output window in Figure 9.16 using Painless. For now, you can delete this code to make space for your own experiments that you will carry out as you read the rest of this chapter. Tip The full Painless language specification is available online here: https://
www.elastic.co/guide/en/elasticsearch/painless/ master/painless-lang-spec.html. You can use it as a reference and resource for further information about the topics covered later.
Variables, operators, and control flow One of the first things we usually want to do in a programming language is to manipulate values. In order to do this effectively, we assign those values names or variables. Painless has types and before a variable can be assigned, it must be declared along with its type. The syntax for declaring a variable is as follows: type_identifier variable_name ;.
Using Painless for advanced transform configurations
279
How to use this syntax in practice is illustrated in the following code block, where we declare variables a and b to hold integer values, the variable my_string to hold a string value, and the variable my_float_array to hold an array of floating values: int a; int b; String my_string; float[] my_float_array;
At this point, the variables do not yet hold any non-null values. They have just been initialized in preparation for an assignment statement, which will assign to each a value of the appropriate type. Thus, if you try copying the preceding code block into the Painless Lab code editor, you will see an output of null in the Output window as shown in Figure 9.17.
Figure 9.17 – On the left, Painless variables of various types are initialized. On the right, the Output panel shows null, because these variables have not yet been assigned a value
Important note The Painless Lab code editor only displays the result of the last statement.
Next, let's assign some values to these variables so that we can do some interesting things with them. The assignments are shown in the following code block. In the first two lines, we assign integer values to our integer variables a and b. In the third line, we assign a string "hello world" to the string variable my_string, and in the final line, we initialize a new array with floating-point values: a = 1; b = 5; my_string = "hello world"; my_double_array = new double[] {1.0, 2.0, 2.5};
280
Introducing Data Frame Analytics
Let's do some interesting things with these variables to illustrate what operators are available in Painless. We will only be able to cover a few of the available operators. For the full list of available operators, please see the Painless language specification (https:// www.elastic.co/guide/en/elasticsearch/painless/current/ painless-operators.html). The following code blocks illustrate basic mathematical operations: addition, subtraction, division, and multiplication as well as the modulus operation or taking the remainder: int a; int b; a = 1; b = 5; // Addition int addition; addition = a+b; // Subtraction int subtraction; subtraction = a-b; // Multiplication int multiplication; multiplication = a*b; // Integer Division int int_division; int_division = a/b; // Remainder int remainder; remainder = a%b;
Try out these code examples on your own in the Painless Lab and you should be able to see the results of your evaluation, as illustrated in the case of addition in Figure 9.18.
Using Painless for advanced transform configurations
281
Figure 9.18 – Using the Painless Lab code editor and console for addition in Painless. The result stored in a variable called "addition" in the code editor on the left is displayed in the Output tab on the right
In addition to mathematical operations, we will also take a look at Boolean operators. These are vital for many Painless scripts and configurations as well as for control flow statements, which we will take a look at afterward. The code snippets that follow illustrate how to declare a variable to hold a Boolean (true/ false) value and how to use comparison operators to determine whether values are less, greater, less than or equal, or greater than or equal. For a full list of Boolean operators in Painless, please consult the Painless specification available here: https://www. elastic.co/guide/en/elasticsearch/painless/current/painlessoperators-boolean.html: boolean boolean boolean boolean
less_than = 45; less_than_or_equal = 4 = 5;
As an exercise, copy the preceding code block into the Painless Lab code editor. If you wish, you can view the contents of each of these variables by typing its name followed by a semicolon into the last line of the Painless Lab code editor and the value stored in the variable will be printed in the Output window on the right, as shown in Figure 9.19.
Figure 9.19 – Typing in the variable name followed by a semicolon in the Painless Lab code editor will output the contents of the variable in the Output tab
282
Introducing Data Frame Analytics
While the Boolean operator illustrated here is useful in many numerical computations, we probably could not write effective control flow statements without the equality operators == and !=, which check whether or not two variables are equal. The following code block illustrates how to use these operators with a few practical examples: // boolean operator for testing for equality boolean two_equal_strings = "hello" == "hello"; two_equal_strings; // boolean operator for testing for inequality boolean not_equal = 5!=6; not_equal;
Last but not least in our tour of the Boolean operators in Painless, we will look at a code block that showcases how to use the instanceof operator, which checks whether a given variable is an instance of a type and returns true or false. This is a useful operator to have when you are writing Painless code that you only want to operate on variables of a specified type: // boolean operator instanceof tests if a variable is an instance of a type // the variable is_integer evaluates to true int int_number = 5; boolean is_integer = int_number instanceof int; is_integer;
In the final part of this section, let's take a look at one of the most important building blocks in our Painless script: if-else statements and for loops. The syntax for if-else statements is shown in the following code block with the help of an example: int a = 5; int sum; if (a < 6){ sum = a+5; }
Using Painless for advanced transform configurations
283
else { sum = a-5; } sum;
In the preceding code block, we declare an integer variable, a, and assign it to contain the integer value 5. We then declare another integer variable, sum. This variable will change according to the execution branch that is taken in the if-else statement. Finally, we see that the if-else statement first checks whether the integer variable a is less than 6, and if it is, stores the result of adding a and the integer 5 in the variable sum. If not, the amount stored in the variable sum is the result of subtracting 5 from a. If you type this code in the Painless Lab code editor, the Output console will print out the value of sum as 10 (as shown in Figure 9.20), which is what we expect based on the previous analysis.
Figure 9.20 – The if-else statement results in the sum variable being set to the value 10
Finally, we will take a look at how to write a for loop, which is useful for various data analysis and data processing tasks with Painless. In our for loop, we will be iterating over a string variable and calculating how many occurrences of the letter a occur in the string. This is a very simple example, but will hopefully help you to understand the syntax so that you can apply it in your own examples: // initialize the string and the counter variable String sample_string = "a beautiful day"; int counter = 0;
284
Introducing Data Frame Analytics
for (int i=0;i