Automating Data Quality Monitoring at Scale: Scaling Beyond Rules with Machine Learning (Final)
9781098145934
The world's businesses ingest a combined 2.5 quintillion bytes of data every day. But how much of this vast amount
125
102
9MB
English
Pages 170
Year 2024
Report DMCA / Copyright
DOWNLOAD EPUB FILE
Table of contents :
Foreword
Preface
Who Should Use This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. The Data Quality Imperative
High-Quality Data Is the New Gold
Data-Driven Companies Are Today’s Disrupters
Data Analytics Is Democratized
AI and Machine Learning Are Differentiators
Generative AI and data quality
Companies Are Investing in a Modern Data Stack
More Data, More Problems
Issues Inside the Data Factory
Data Migrations
Third-Party Data Sources
Company Growth and Change
Exogenous Factors
Why We Need Data Quality Monitoring
Data Scars
Data Shocks
Automating Data Quality Monitoring: The New Frontier
2. Data Quality Monitoring Strategies and the Role of Automation
Monitoring Requirements
Data Observability: Necessary, but Not Sufficient
Traditional Approaches to Data Quality
Manual Data Quality Detection
Rule-Based Testing
Metrics Monitoring
Automating Data Quality Monitoring with Unsupervised Machine Learning
What Is Unsupervised Machine Learning?
An Analogy: Lane Departure Warnings
The Limits of Automation
Automating rule and metric creation
Rules
Metrics
A Four-Pillar Approach to Data Quality Monitoring
3. Assessing the Business Impact of Automated Data Quality Monitoring
Assessing Your Data
Volume
Variety
Unstructured data
Semistructured data
Structured data
Normalized relational data
Fact tables
Summary tables
Velocity
Veracity
Special Cases
Assessing Your Industry
Regulatory Pressure
AI/ML Risks
Feature shocks
NULL increases
Change in correlation
Duplicate data
Data as a Product
Assessing Your Data Maturity
Assessing Benefits to Stakeholders
Engineers
Data Leadership
Scientists
Consumers
Conducting an ROI Analysis
Quantitative Measures
Qualitative Measures
Conclusion
4. Automating Data Quality Monitoring with Machine Learning
Requirements
Sensitivity
Specificity
Transparency
Scalability
Nonrequirements
Data Quality Monitoring Is Not Outlier Detection
ML Approach and Algorithm
Data Sampling
Sample size
Bias and efficiency
Feature Encoding
Model Development
Training and evaluation
Computational efficiency
Model Explainability
Putting It Together with Pseudocode
Other Applications
Conclusion
5. Building a Model That Works on Real-World Data
Data Challenges and Mitigations
Seasonality
Time-Based Features
Chaotic Tables
Updated-in-Place Tables
Column Correlations
Model Testing
Injecting Synthetic Issues
Example
Benchmarking
Analyzing performance
Putting it together with pseudocode
Improving the Model
Conclusion
6. Implementing Notifications While Avoiding Alert Fatigue
How Notifications Facilitate Data Issue Response
Triage
Routing
Resolution
Documentation
Taking Action Without Notifications
Anatomy of a Notification
Visualization
Actions
Text Description
Who Created/Last Edited the Check
Delivering Notifications
Notification Audience
Notification Channels
Email
Real-time communication
PagerDuty or Opsgenie-type platforms (alerting, on-call management)
Ticketing platforms (Jira, ServiceNow)
Webhooks
Notification Timing
Avoiding Alert Fatigue
Scheduling Checks in the Right Order
Clustering Alerts Using Machine Learning
Suppressing Notifications
Priority level
Continuous retraining
Narrowing the scope of the model
Making the check less sensitive
What not to suppress: Expected changes
Automating the Root Cause Analysis
Conclusion
7. Integrating Monitoring with Data Tools and Systems
Monitoring Your Data Stack
Data Warehouses
Integrating with Data Warehouses
Security
Reconciling Data Across Multiple Warehouses
Comparing datasets with rule-based testing
Comparing datasets with unsupervised machine learning
Comparing summary statistics
Data Orchestrators
Integrating with Orchestrators
Data Catalogs
Integrating with Catalogs
Data Consumers
Analytics and BI Tools
MLOps
Conclusion
8. Operating Your Solution at Scale
Build Versus Buy
Vendor Deployment Models
SaaS
Fully in-VPC or on-prem
Hybrid
Configuration
Determining Which Tables Are Most Important
Deciding What Data in a Table to Monitor
Configuration at Scale
Enablement
User Roles and Permissions
Onboarding, Training, and Support
Improving Data Quality Over Time
Initiatives
Metrics
Triage and resolution
Executive dashboards
Scorecards
From Chaos to Clarity
A. Types of Data Quality Issues
Table Issues
Late Arrival
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Schema Changes
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Untraceable Changes
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Row Issues
Incomplete Rows
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Duplicate Rows
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Temporal Inconsistency
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Value Issues
Missing Values
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Incorrect Values
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Invalid Values
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Multi Issues
Relational Failures
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Inconsistent Sources
Definition
Example
Causes
Analytics impact
ML impact
How to monitor
Index