Training Data for Machine Learning 9781492094524

Your training data has as much to do with the success of your data project as the algorithms themselves because most fai

128 17 13MB

English Pages 329 Year 2023

Report DMCA / Copyright

DOWNLOAD EPUB FILE

Table of contents :
Preface
Who Should Read This Book?
For the Technical Professional and Engineer
For the Manager and Director
For the Subject Matter Expert and Data Annotation Specialist
For the Data Scientist
Why I Wrote This Book
How This Book Is Organized
Themes
The Basics and Getting Started
Concepts and Theories
Putting It All Together
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. Training Data Introduction
Training Data Intents
What Can You Do With Training Data?
What Is Training Data Most Concerned With?
Schema
Raw data
Annotations
Quality
Integrations
The human role
Training Data Opportunities
Business Transformation
Training Data Efficiency
Tooling Proficiency
Process Improvement Opportunities
Why Training Data Matters
ML Applications Are Becoming Mainstream
The Foundation of Successful AI
Training Data Is Here to Stay
Training Data Controls the ML Program
New Types of Users
Training Data in the Wild
What Makes Training Data Difficult?
The Art of Supervising Machines
A New Thing for Data Science
ML Program Ecosystem
Raw data media types
Data-Centric Machine Learning
Failures
History of Development Affects Training Data Too
What Training Data Is Not
Generative AI
Human Alignment Is Human Supervision
Summary
2. Getting Up and Running
Introduction
Getting Up and Running
Installation
Tasks Setup
Annotator Setup
Portal (default)
Embedded
Data Setup
Workflow Setup
Data Catalog Setup
Initial Usage
Optimization
Tools Overview
Training Data for Machine Learning
Growing Selection of Tools
People, Process, and Data
Embedded Supervision
Human Computer Supervision
Separation of End Concerns
Standards
Many Personas
A Paradigm to Deliver Machine Learning Software
Trade-Offs
Costs
Installed Versus Software as a Service
Development System
Sequentially dependent discoveries
Scale
Why is it useful to define scale?
Transitioning from small to medium scale
Large-scale thoughts
Installation Options
Packaging
Storage
Database
Data configuration
Annotation Interfaces
Modeling Integration
Multi-User versus Single-User Systems
Integrations
Scope
Platform and suite solutions
Decision-making process
Cautions
Point solutions
Tools in between
Hidden Assumptions
Security
Security architecture
Attack surface
Security configuration
Security benefits
User access
Data science access
Root-level access
Open Source and Closed Source
Choose an open source tool to get up and running quickly
See the forest from the trees
Capability over optimizations
Ease of use in different flows
Vastly different assumptions
Look at settings, not first impressions
Is it easy to use, or just lacking features?
Customization is the name of the game
History
Open Source Standards
Realizing the Need for Dedicated Tooling
More usage, more demands
Advent of new standards
Summary
3. Schema
Schema Deep Dive Introduction
Labels and Attributes—What Is It?
What Do We Care About?
Introduction to Labels
Attributes Introduction
Attribute concepts
Schema complexity trade-off
Attribute depth
Attribute Complexity Exceeds Spatial Complexity
The hidden background case
Example of sharing attributes between labels
Technical Overview
Example of an attribute in relation to an instance
Data representations for engineering
Examples of attributes
Technical example of an attribute
Spatial Representation—Where Is It?
Using Spatial Types to Prevent Social Bias
One way to avoid spatial bias
Joint responsibility
Trade-Offs with Types
Computer Vision Spatial Type Examples
Full image tag
Box (2D)
Polygon
Ellipse and circle
Cuboid
Types with multiple uses
Other types
Raster mask
Polygons and raster masks
Keypoint geometry
Custom spatial templates
Complex spatial types
Relationships, Sequences, Time Series: When Is It?
Sequences and Relationships
When
Guides and Instructions
Judgment Calls
Relation of Machine Learning Tasks to Training Data
Semantic Segmentation
Image Classification (Tags)
Object Detection
Pose Estimation
Relationship of Tasks to Training Data Types
General Concepts
Instance Concept Refresher
Upgrading Data Over Time
The Boundary Between Modeling and Training Data
Raw Data Concepts
Summary
4. Data Engineering
Introduction
Who Wants the Data?
Annotators
Data scientists
ML programs
Application engineers
Other stakeholders
A Game of Telephone
When a system of record is needed
Planning a Great System
Naive and Training Data–Centric Approaches
Naive approaches
Training data–centric (system of record)
The first steps
Raw Data Storage
By Reference or by Value
Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware
Data Storage: Where Does the Data Rest?
External Reference Connection
Raw Media (BLOB)–Type Specific
Images
Video
3D
Text
Medical
Geospatial
Formatting and Mapping
User-Defined Types (Compound Files)
Defining DataMaps
Ingest Wizards
Organizing Data and Useful Storage
Remote Storage
Versioning
Per-instance history
Per file and per set
Per-export snapshots
Data Access
Disambiguating Storage, Ingestion, Export, and Access
File-Based Exports
Streaming Data
Streaming benefits
Streaming drawbacks
Example: Fetch and stream
Queries Introduction
Integrations with the Ecosystem
Security
Access Control
Identity and Authorization
Example of Setting Permissions
Signed URLs
Cloud connections and signed URLs
Personally Identifiable Information
PII-compliant data chain
PII avoidance
PII removal
Pre-Labeling
Updating Data
Pre-labeling gotchas
Pre-labeling data prep process
Summary
5. Workflow
Introduction
Glue Between Tech and People
Why Are Human Tasks Needed?
Partnering with Non-Software Users in New Ways
Getting Started with Human Tasks
Basics
Schemas’ Staying Power
User Roles
Training
Gold Standard Training
Task Assignment Concepts
Do You Need to Customize the Interface?
How Long Will the Average Annotator Be Using It?
Tasks and Project Structure
Quality Assurance
Annotator Trust
Annotators Are Partners
Who supervises the data
All training data has errors
Annotator needs
Common Causes of Training Data Errors
Task Review Loops
Standard review loop
Consensus
Analytics
Annotation Metrics Examples
Data Exploration
Data exploration tool example
Explore processes
Explore examples
Similar image reduction
Models
Using the Model to Debug the Humans
Distinctions Between a Dataset, Model, and Model Run
Getting Data to Models
Dataflow
Overview of Streaming
Data Organization
Folders and static organization
Filters and dynamic organization
Pipelines and Processes
The dataset connection
Sending a single file to that set
Relating a dataset to a template
Putting the whole example together
Expanding the example
Non-linear example
Hooks
Direct Annotation
Business Process Integration
Attributes
Depth of Labeling
Supervising Existing Data
Interactive Automations
Example: Semantic Segmentation Auto Bordering
Video
Motion
Examples of tracking objects through time (time series)
Static objects
Persistent objects: football example
Series example
Video events
Detecting sequence errors
Common issues in video annotation
Summary
6. Theories, Concepts, and Maintenance
Introduction
Theories
A System Is Only as Useful as Its Schema
Who Supervises the Data Matters
Intentionally Chosen Data Is Best
Working with Historical Data
Training Data Is Like Code
Surface Assumptions Around Usage of Your Training Data
Use definitions and processes to protect against assumptions
Human Supervision Is Different from Classic Datasets
Discovery versus automation
Discovery
General Concepts
Data Relevancy
Overall system design
Raw data collection
Need for Both Qualitative and Quantitative Evaluations
Iterations
Prioritization: What to Label
Transfer Learning’s Relation to Datasets (Fine-Tuning)
Per-Sample Judgment Calls
Ethical and Privacy Considerations
Bias
Bias Is Hard to Escape
Metadata
Preventing Lost Metadata
Train/Val/Test Is the Cherry on Top
Sample Creation
Simple Schema for a Strawberry Picking System
Geometric Representations
Binary Classification
Let’s Manually Create Our First Set
Upgraded Classification
Where Is the Traffic Light?
Maintenance
Actions
Increase schema depth to improve performance
Better align the spatial type to the raw data
Create more tasks
Change the raw data
Net Lift
Levels of System Maturity of Training Data Operations
Applied Versus Research Sets
Training Data Management
Quality
Completed Tasks
Freshness
Maintaining Set Metadata
Task Management
Summary
7. AI Transformation and Use Cases
Introduction
AI Transformation
Seeing Your Day-to-Day Work as Annotation
The Creative Revolution of Data-centric AI
You Can Create New Data
You Can Change What Data You Collect
You Can Change the Meaning of the Data
You Can Create!
Think Step Function Improvement for Major Projects
Build Your AI Data to Secure Your AI Present and Future
Appoint a Leader: The Director of AI Data
New Expectations People Have for the Future of AI
Sometimes Proposals and Corrections, Sometimes Replacement
Upstream Producers and Downstream Consumers
Producer and consumer comparison
Producer and consumer mindset
Why is new structure needed?
The budget
The AI Director’s background
Director of Training Data role
AI-focused company modifications
Classic company modification
Spectrum of Training Data Team Engagement
Dedicated Producers and Other Teams
Organizing Producers from Other Teams
Director of AI data responsibilities
Training Data Evangelist
Training Data Production Manager(s)
Annotation Producer
Data Engineer
Use Case Discovery
Rubric for Good Use Cases
Detailed rubric
Adds a new capability use case
Repeating use cases
Specialists and experts
Evaluating a Use Case Against the Rubric
Automatic background removal
Evaluation example
Conceptual Effects of Use Cases
Ongoing impact of use cases
The New “Crowd Sourcing”: Your Own Experts
Key Levers on Training Data ROI
What the Annotated Data Represents
Trade-Offs of Controlling Your Own Training Data
The Need for Hardware
Common Project Mistakes
Modern Training Data Tools
Think Learning Curve, Not Perfection
New Training and Knowledge Are Required
Everyone
Annotators
Managers
Executives
How Companies Produce and Consume Data
Trap to Avoid: Premature Optimization in Training Data
No Silver Bullets
Culture of Training Data
New Engineering Principles
Summary
8. Automation
Introduction
Getting Started
Motivation: When to Use These Methods?
Check What Part of the Schema a Method Is Designed to Work On
What Do People Actually Use?
Commonly used techniques
Domain-specific
A note on ordering
What Kind of Results Can I Expect?
Common Confusions
“Fully” automatic labeling for novel model creation
Proprietary automatic methods
User Interface Optimizations
Risks
Trade-Offs
Nature of Automations
Setup Costs
How to Benchmark Well
How to Scope the Automation Relative to the Problem
Correction Time
Subject Matter Experts
Consider How the Automations Stack
Pre-Labeling
Standard Pre-Labeling
Benefits
Caveats
Pre-Labeling a Portion of the Data Only
Use off-the-shelf models
Clear separation of concerns
The “one step early” trick
How to get started pre-labeling
Interactive Annotation Automation
Creating Your Own
Technical Setup Notes
What Is a Watcher? (Observer Pattern)
How to Use a Watcher
Interactive Capturing of a Region of Interest
Interactive Drawing Box to Polygon Using GrabCut
Full Image Model Prediction Example
Example: Person Detection for Different Attribute
Quality Assurance Automation
Using the Model to Debug the Humans
Automated Checklist Example
Domain-Specific Reasonableness Checks
Data Discovery: What to Label
Human Exploration
Raw Data Exploration
Metadata Exploration
Adding Pre-Labeling-Based Metadata
Augmentation
Better Models Are Better than Better Augmentation
To Augment or Not to Augment
Training/runtime augmentation
Patch and inject method (crop and inject)
Simulation and Synthetic Data
Simulations Still Need Human Review
Media Specific
What Methods Work with Which Media?
Considerations
Media-Specific Research
Domain Specific
Geometry-Based Labeling
Multi-sensor labeling automation—spatial
Spatial labeling
Heuristics-Based Labeling
Summary
9. Case Studies and Stories
Introduction
Industry
A Security Startup Adopts Training Data Tools
Quality Assurance at a Large-Scale Self-Driving Project
Tricky schemas should be expanded, not shrunk
Don’t justify a clearly bad schema with domain-specific assumptions
Tracking spatial quality and errors per image
Regression and focused effort do not always solve specific problems
Overfocus on complex instructions instead of fixing the schema
Trade-offs of attempting to achieve “perfection” in nuanced domain-specific cases
Understanding nuanced cases
Learning from mistakes
Define occlusion well
Expand schemas
Remember the null case
Missing assumptions for language barriers
Don’t overfocus on spatial information
Big-Tech Challenges
Two annotation software teams
Confusing the media types
Non-queryable
Different teams for annotations and raw media
Moving toward a system of record
Missing the big picture
Solution
Let’s address loops
Human in the loop
The case for aligning teams around training data
Insurance Tech Startup Lessons
Will the production data match the training data?
Too late to bring in training data software
Stories
“Static Schema Prevented Innovation at Self Driving Firm”
“Startup Didn’t Change Schema and Wasted Effort”
“Accident Prevention Startup Missed Data-Centric Approach”
“Sports Startup Successfully Used Pre-Labeling”
An Academic Approach to Training Data
Kaggle TSA Competition
Keying in on training data
How focusing on training data reveals commercial efficiencies
Learning lessons and mistakes
Summary
Index

Training Data for Machine Learning
 9781492094524

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
Recommend Papers