583 134 11MB
English Pages 394 [414] Year 2023
Deep Learning and Scientific Computing with R torch torch is an R port of PyTorch, one of the two most-employed deep learning frameworks in industry and research. It is also an excellent tool to use in scientific computations. It is written entirely in R and C/C++. Though still “young” as a project, R torch already has a vibrant community of users and developers. Experience shows that torch users come from a broad range of different backgrounds. This book aims to be useful to (almost) everyone. Globally speaking, its purposes are threefold: - Provide a thorough introduction to torch basics – both by carefully explaining underlying concepts and ideas, and showing enough examples for the reader to become “fluent” in torch. - Again with a focus on conceptual explanation, show how to use torch in deep-learning applications, ranging from image recognition over time series prediction to audio classification. - Provide a concepts-first, reader-friendly introduction to selected scientificcomputation topics (namely, matrix computations, the Discrete Fourier Transform, and wavelets), all accompanied by torch code you can play with. Deep Learning and Scientific Computing with R torch is written with first-hand technical expertise and in an engaging, fun-to-read way.
Chapman & Hall/CRC The R Series Series Editors John M. Chambers, Department of Statistics, Stanford University, California, USA Torsten Hothorn, Division of Biostatistics, University of Zurich, Switzerland Duncan Temple Lang, Department of Statistics, University of California, Davis, USA Hadley Wickham, RStudio, Boston, Massachusetts, USA Recently Published Titles R for Conservation and Development Projects: A Primer for Practitioners Nathan Whitmore Using R for Bayesian Spatial and Spatio-Temporal Health Modeling Andrew B. Lawson Engineering Production-Grade Shiny Apps Colin Fay, Sébastien Rochette, Vincent Guyader, and Cervan Girard Javascript for R John Coene Advanced R Solutions Malte Grosser, Henning Bumann, and Hadley Wickham Event History Analysis with R, Second Edition Göran Broström Behavior Analysis with Machine Learning Using R Enrique Garcia Ceja Rasch Measurement Theory Analysis in R: Illustrations and Practical Guidance for Researchers and Practitioners Stefanie Wind and Cheng Hua Spatial Sampling with R Dick R. Brus Crime by the Numbers: A Criminologist’s Guide to R Jacob Kaplan Analyzing US Census Data: Methods, Maps, and Models in R Kyle Walker ANOVA and Mixed Models: A Short Introduction Using R Lukas Meier Tidy Finance with R Stefan Voigt, Patrick Weiss and Christoph Scheuch Deep Learning and Scientific Computing with R torch Sigrid Keydana Model-Based Clustering, Classification, and Density Estimation Using mclust in R Lucca Scrucca, Chris Fraley, T. Brendan Murphy, and Adrian E. Raftery Spatial Data Science: With Applications in R Edzer Pebesma and Roger Bivand For more information about this series, please visit: https://www.crcpress.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER
Deep Learning and Scientific Computing with R torch
Sigrid Keydana
Designed cover image: https://www.shutterstock.com/image-photo/eurasian-red-squirrel-sciurus-vulgaris-looking-2070311126 First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 Sigrid Keydana Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 978-1-032-23138-9 (hbk) ISBN: 978-1-032-23139-6 (pbk) ISBN: 978-1-003-27592-3 (ebk) DOI: 10.1201/9781003275923 Typeset in Latin Modern font by KnowledgeWorks Global Ltd.
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents
List of Figures
xi
Preface
xv
Author Biography
I
xix
Getting Familiar with Torch
1
1 Overview
3
2 On torch, and How to Get It 2.1 In torch World . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Installing and Running torch . . . . . . . . . . . . . . . . .
5 5 6
3 Tensors 3.1 What’s in a Tensor? . . . . . . . . . . . . . . . . . . 3.2 Creating Tensors . . . . . . . . . . . . . . . . . . . . 3.2.1 Tensors from values . . . . . . . . . . . . . . 3.2.2 Tensors from specifications . . . . . . . . . . 3.2.3 Tensors from datasets . . . . . . . . . . . . . 3.3 Operations on Tensors . . . . . . . . . . . . . . . . . 3.3.1 Summary operations . . . . . . . . . . . . . . 3.4 Accessing Parts of a Tensor . . . . . . . . . . . . . 3.4.1 “Think R” . . . . . . . . . . . . . . . . . . . . 3.5 Reshaping Tensors . . . . . . . . . . . . . . . . . . . 3.5.1 Zero-copy reshaping vs. reshaping with copy . 3.6 Broadcasting . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Broadcasting rules . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
7 7 9 9 12 14 19 21 23 23 26 27 29 31
4 Autograd 4.1 Why Compute Derivatives? . . . . . . . . . . . . . . . . . . . 4.2 Automatic Differentiation Example . . . . . . . . . . . . . . 4.3 Automatic Differentiation with torch autograd . . . . . . . .
33 33 35 36
5 Function Minimization with autograd 5.1 An Optimization Classic . . . . . . . . . . . . . . . . . . . . 5.2 Minimization from Scratch . . . . . . . . . . . . . . . . . . .
41 41 42
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
v
vi 6 A Neural Network from Scratch 6.1 Idea . . . . . . . . . . . . . . . 6.2 Layers . . . . . . . . . . . . . . 6.3 Activation Functions . . . . . . 6.4 Loss Functions . . . . . . . . . 6.5 Implementation . . . . . . . . 6.5.1 Generate random data . 6.5.2 Build the network . . . 6.5.3 Train the network . . .
Contents . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
7 Modules 7.1 Built-in nn_module()s . . . . . . . . . . . . . . . . . . 7.2 Building up a Model . . . . . . . . . . . . . . . . . . . . 7.2.1 Models as sequences of layers: nn_sequential() index{nn_sequential()} . . . . . . . . . . . . . 7.2.2 Models with custom logic . . . . . . . . . . . . .
. . . . . . . .
47 47 48 49 51 52 52 52 53
. . . . . .
57 57 60
. . . . . .
60 61
. . . . . . . .
. . . . . . . .
8 Optimizers 8.1 Why Optimizers? . . . . . . . . . . . . . . . . . . . . . . . 8.2 Using built-in torch Optimizers . . . . . . . . . . . . . . . 8.3 Parameter Update Strategies . . . . . . . . . . . . . . . . . 8.3.1 Gradient descent (a.k.a. steepest descent, a.k.a. stochastic gradient descent (SGD)) . . . . . . . . . . 8.3.2 Things that matter . . . . . . . . . . . . . . . . . . . 8.3.3 Staying on track: Gradient descent with momentum 8.3.4 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.5 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . 8.3.6 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Loss Functions 9.1 torch Loss Functions . . . . . . . . . 9.2 What Loss Function Should I Choose? 9.2.1 Maximum likelihood . . . . . . 9.2.2 Regression . . . . . . . . . . . . 9.2.3 Classification . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
10 Function Minimization with L-BFGS 10.1 Meet L-BFGS . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Changing slopes . . . . . . . . . . . . . . . . . . 10.1.2 Exact Newton method . . . . . . . . . . . . . . . 10.1.3 Approximate Newton: BFGS and L-BFGS . . . . 10.1.4 Line search . . . . . . . . . . . . . . . . . . . . . 10.2 Minimizing the Rosenbrock Function with optim_lbfgs() 10.2.1 optim_lbfgs() default behavior . . . . . . . . . 10.2.2 optim_lbfgs() with line search . . . . . . . . .
. . .
63 63 64 65
. . . . . .
66 67 68 69 71 72
. . . . .
. . . . .
75 75 76 76 76 77
. . . . . . . . . .
. . . . . . . .
83 83 83 84 85 85 86 87 89
. . . . . . . . . .
Contents 11 Modularizing the Neural Network 11.1 Data . . . . . . . . . . . . . . . . 11.2 Network . . . . . . . . . . . . . . 11.3 Training . . . . . . . . . . . . . . 11.4 What’s to Come . . . . . . . . . .
II
vii . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Deep Learning with torch
91 91 92 92 93
95
12 Overview
97
13 Loading Data 13.1 Data vs. dataset() vs. dataloader() – What’s the Difference? 13.2 Using dataset()s . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 A self-built dataset() . . . . . . . . . . . . . . . . . . 13.2.2 tensor_dataset() . . . . . . . . . . . . . . . . . . . . 13.2.3 torchvision::mnist_dataset() . . . . . . . . . . . . 13.3 Using dataloader()s . . . . . . . . . . . . . . . . . . . . . .
99 99 100 101 102 103 104
14 Training with luz 107 14.1 Que haya luz – Que haja luz – Let there be Light . . . . . . 107 14.2 Porting the Toy Example . . . . . . . . . . . . . . . . . . . . 108 14.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 14.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 14.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . 109 14.3 A More Realistic Scenario . . . . . . . . . . . . . . . . . . . . 111 14.3.1 Integrating training, validation, and test . . . . . . . . 111 14.3.2 Using callbacks to “hook” into the training process . . 114 14.3.3 How luz helps with devices . . . . . . . . . . . . . . . 116 14.4 Appendix: A Train-Validate-Test Workflow Implemented by Hand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 15 A First Go at Image Classification 15.1 What does It Take to Classify an Image? . . . . . . . . . . . 15.2 Neural Networks for Feature Detection and Feature Emergence 15.2.1 Detecting low-level features with cross-correlation . . 15.2.2 Build up feature hierarchies . . . . . . . . . . . . . . . 15.3 Classification on Tiny Imagenet . . . . . . . . . . . . . . . . 15.3.1 Data pre-processing . . . . . . . . . . . . . . . . . . . 15.3.2 Image classification from scratch . . . . . . . . . . . .
121 121 122 122 128 134 135 137
16 Making Models Generalize 16.1 The Royal Road: more – and More Representative! – 16.2 Pre-processing Stage: Data Augmentation . . . . . . 16.2.1 Classic data augmentation . . . . . . . . . . . 16.2.2 Mixup . . . . . . . . . . . . . . . . . . . . . . 16.3 Modeling Stage: Dropout and Regularization . . . .
141 142 142 142 147 151
Data . . . . . . . . . . . .
. . . .
. . . . .
viii
Contents 16.3.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Regularization . . . . . . . . . . . . . . . . . . . . . . 16.4 Training Stage: Early Stopping . . . . . . . . . . . . . . . . .
17 Speeding up Training 17.1 Batch Normalization . . . . . 17.2 Dynamic Learning Rates . . . 17.2.1 Learning rate finder . . 17.2.2 Learning rate schedulers 17.3 Transfer Learning . . . . . . .
151 153 154
. . . . .
. . . . .
157 157 159 160 163 164
18 Image Classification, Take Two: Improving Performance 18.1 Data Input (Common for all) . . . . . . . . . . . . . . . . . 18.2 Run 1: Dropout . . . . . . . . . . . . . . . . . . . . . . . . 18.3 Run 2: Batch Normalization . . . . . . . . . . . . . . . . . 18.4 Run 3: Transfer Learning . . . . . . . . . . . . . . . . . . .
. . . .
169 170 171 174 177
19 Image Segmentation 19.1 Segmentation vs. Classification . . . . . 19.2 U-Net, a “classic” in image segmentation 19.3 U-Net – a torch implementation . . . . 19.3.1 Encoder . . . . . . . . . . . . . . 19.3.2 Decoder . . . . . . . . . . . . . . 19.3.3 The “U” . . . . . . . . . . . . . . 19.3.4 Top-level module . . . . . . . . . 19.4 Dogs and Cats . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
181 181 182 183 183 186 190 191 192
20 Tabular Data 20.1 Types of Numerical Data, by Example . . . . . . . 20.2 A torch dataset for Tabular Data . . . . . . . . 20.3 Embeddings in Deep Learning: The Idea . . . . . 20.4 Embeddings in deep learning: Implementation . . 20.5 Model and Model Training . . . . . . . . . . . . . 20.6 Embedding-generated Representations by Example
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
201 201 204 208 209 211 214
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
219 219 221 221 222 225 226 228 229 231 239
. . . . . . . . . . . . . . .
. . . . . . . .
21 Time Series 21.1 Deep Learning for Sequences: The Idea . . . 21.2 A Basic Recurrent Neural Network . . . . . . 21.2.1 Basic rnn_cell() . . . . . . . . . . . 21.2.2 Basic rnn_module() . . . . . . . . . . 21.3 Recurrent Neural Networks in torch . . . . 21.4 RNNs in Practice: GRU and LSTM . . . . . 21.5 Forecasting Electricity Demand . . . . . . . 21.5.1 Data inspection . . . . . . . . . . . . . 21.5.2 Forecasting the very next value . . . . 21.5.3 Forecasting multiple time steps ahead
. . . . . . . .
. . . . . . . . . .
. . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Contents
ix
22 Audio Classification 22.1 Classifying Speech Data . . . . . . . . . . . . . . . . . . . . 22.2 Two Equivalent Representations . . . . . . . . . . . . . . . 22.3 Combining Representations: The Spectrogram . . . . . . . 22.4 Training a Model for Audio Classification . . . . . . . . . . 22.4.1 Baseline setup: Training a convnet on spectrograms 22.4.2 Variation one: Use a Mel-scale spectrogram instead . 22.4.3 Variation two: Complex-valued spectograms . . . . .
. . . . . . .
III Other Things to do with torch: Matrices, Fourier Transform, and Wavelets
273
23 Overview
275
24 Matrix Computations: Least-squares Problems 24.1 Five Ways to do Least Squares . . . . . . . . . . . . . . . . 24.2 Regression for Weather Prediction . . . . . . . . . . . . . . 24.2.1 Least squares (I): Setting expectations with lm() . . 24.2.2 Least squares (II): Using linalg_lstsq() . . . . . . 24.2.3 Interlude: What if we hadn’t standardized the data? 24.2.4 Least squares (III): The normal equations . . . . . . 24.2.5 Least squares (IV): Cholesky decomposition . . . . . 24.2.6 Least squares (V): LU factorization . . . . . . . . . . 24.2.7 Least squares (VI): QR factorization . . . . . . . . . 24.2.8 Least squares (VII): Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.9 Checking execution times . . . . . . . . . . . . . . . 24.3 A Quick Look at Stability . . . . . . . . . . . . . . . . . . . 25 Matrix Computations: Convolution 25.1 Why Convolution? . . . . . . . . . . . . . . . 25.2 Convolution in One Dimension . . . . . . . . 25.2.1 Two ways to think about convolution . 25.2.2 Implementation . . . . . . . . . . . . . 25.3 Convolution in Two Dimensions . . . . . . . 25.3.1 How it works (output view) . . . . . . 25.3.2 Implementation . . . . . . . . . . . . .
247 247 250 251 254 254 261 267
. . . . . . . . .
277 277 278 280 282 283 287 289 292 293
. . .
294 296 299
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
305 305 306 307 310 311 312 313
26 Exploring the Discrete Fourier Transform (DFT) 26.1 Understanding the Output of torch_fft_fft() . 26.1.1 Starting point: A cosine of frequency 1 . . . 26.1.2 Reconstructing the magic . . . . . . . . . . 26.1.3 Varying frequency . . . . . . . . . . . . . . 26.1.4 Varying amplitude . . . . . . . . . . . . . . 26.1.5 Adding phase . . . . . . . . . . . . . . . . . 26.1.6 Superposition of sinusoids . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
319 319 320 324 327 330 330 333
. . . . . . .
. . . . . . .
x
Contents 26.2 Coding the DFT . . . . . . . . . . . . . . . . . . . . . . . . . 26.3 Fun with sox . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 The Fast Fourier Transform (FFT) 27.1 Some Terminology . . . . . . . . . . . . . . . . . . . . . 27.2 Radix-2 decimation-in-time(DIT) walkthrough . . . . . 27.2.1 The main idea: Recursive split . . . . . . . . . . 27.2.2 One further simplification . . . . . . . . . . . . . 27.3 FFT as Matrix Factorization . . . . . . . . . . . . . . . 27.4 Implementing the FFT . . . . . . . . . . . . . . . . . . 27.4.1 DFT, the “loopy” way . . . . . . . . . . . . . . . 27.4.2 DFT, vectorized . . . . . . . . . . . . . . . . . . 27.4.3 Radix-2 decimation in time FFT, recursive . . . 27.4.4 Radix-2 decimation in time FFT by matrix factorization . . . . . . . . . . . . . . . . . . . . 27.4.5 Radix-2 decimation in time FFT, optimized for vectorization . . . . . . . . . . . . . . . . . . . . 27.4.6 Checking against torch_fft_fft() . . . . . . . 27.4.7 Comparing performance . . . . . . . . . . . . . . 27.4.8 Making use of Just-in-Time (JIT) compilation .
333 336
. . . . . . . . .
343 343 344 344 346 348 350 350 351 351
. . .
352
. . . .
. . . .
. . . .
354 355 355 357
28 Wavelets 28.1 Introducing the Morlet Wavelet . . . . . . . . . . . . . . 28.2 The roles of 𝐾 and 𝜔𝑎 . . . . . . . . . . . . . . . . . . . . 28.3 Wavelet Transform: A Straightforward Implementation . 28.4 Resolution in Time versus in Frequency . . . . . . . . . . 28.5 How is this Different from a Spectrogram? . . . . . . . . 28.6 Performing the Wavelet Transform in the Fourier Domain 28.7 Creating the Wavelet Diagram . . . . . . . . . . . . . . . 28.8 A Real-world Example: Chaffinch’s Song . . . . . . . . .
. . . . . . . .
. . . . . . . .
361 362 364 366 369 370 371 375 382
. . . . . . . . .
. . . . . . . . .
References
389
Index
391
List of Figures
3.1
A 4x3x2 tensor. . . . . . . . . . . . . . . . . . . . . . . . . .
11
4.1 4.2
Hypothetical loss function (a paraboloid). . . . . . . . . . . Example of a computational graph. . . . . . . . . . . . . . .
34 35
5.1
Rosenbrock function. . . . . . . . . . . . . . . . . . . . . . .
42
8.1
Steepest descent on an isotropic paraboloid, using different learning rates. . . . . . . . . . . . . . . . . . . . . . . . . . . Steepest descent on a non-isotropic paraboloid, using (minimally!) different learning rates. . . . . . . . . . . . . . SGD with momentum (white), compared with vanilla SGD (gray). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adagrad (white), compared with vanilla SGD (gray). . . . . RMSProp (white), compared with vanilla SGD (gray). . . . Adam (white), compared with vanilla SGD (gray). . . . . . .
8.2 8.3 8.4 8.5 8.6
15.1 White square on black background. . . . . . . . . . . . . . . 15.2 Gimp convolution matrix that detects the left edge. . . . . . 15.3 Gimp convolution matrix that detects the right edge. . . . . 15.4 Gimp convolution matrix that detects the top edge. . . . . . 15.5 Gimp convolution matrix that detects the bottom edge. . . . 15.6 Input image, filter, and result as pixel values. Negative pixel values being impossible, −255 will end up as 0. . . . . . . . 15.7 Convolution, and the effect of padding. Copyright Dumoulin and Visin (2016), reproduced under MIT license. . . . . . . 15.8 Convolution, and the effect of strides. Copyright Dumoulin and Visin (2016), reproduced under MIT license. . . . . . . 15.9 Convolution, and the effect of dilation. Copyright Dumoulin and Visin (2016), reproduced under MIT license. . . . . . . 15.10 Feature visualization on a subset of layers of GoogleNet. Figure from Olah, Mordvintsev, and Schubert (2017), reproduced under without modification. . . . . . . . . . . . . 16.1 16.2 16.3
MNIST: The first thirty-two images in the test set. . . . . . MNIST, with random rotations, translations, and flips. . . . Mixing up MNIST, with mixing weights of 0.9. . . . . . . . .
67 68 70 71 72 73 124 124 125 126 127 127 132 132 133
134 144 145 148 xi
xii
List of Figures 16.4 16.5
Mixing up MNIST, with mixing weights of 0.7. . . . . . . . . Mixing up MNIST, with mixing weights of 0.5. . . . . . . . .
149 150
17.1
Output of luz’s learning rate finder, run on MNIST. . . . .
162
18.1
Learning rate finder, run on Tiny Imagenet. dropout layers. . . . . . . . . . . . . . . . . . . Learning rate finder, run on Tiny Imagenet. batchnorm layers. . . . . . . . . . . . . . . . . Learning rate finder, run on Tiny Imagenet. transfer learning (ResNet). . . . . . . . . . . .
18.2 18.3
Convnet . . . . . Convnet . . . . . Convnet . . . . .
with . . . with . . . with . . .
U-Net architecture from Ronneberger, Fischer, and Brox (2015), reproduced with the principal author’s permission. . 19.2 Transposed convolution. Copyright Dumoulin and Visin (2016), reproduced under MIT license. . . . . . . . . . . . . 19.3 Learning rate finder, run on the Oxford Pet Dataset. . . . . 19.4 Cats and dogs: Sample images and predicted segmentation masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172 175 178
19.1
20.1 20.2 20.3 21.1 21.2 21.3 21.4 21.5 21.6 21.7 21.8 22.1 22.2 22.3 22.4 22.5 22.6
Losses and accuracies (training and validation, resp.) for binary heart disease classification. . . . . . . . . . . . . . . . heart_disease$slope: PCA of embedding weights, biplot visualizing factor loadings. . . . . . . . . . . . . . . . . . . . heart_disease$slope: PCA of embedding weights, locating the original input values in two-dimensional space. . . . . . One year of electricity demand, decomposed into trend, seasonal components, and remainder. . . . . . . . . . . . . . A single month of electricity demand, decomposed into trend, seasonal components, and remainder. . . . . . . . . . . . . . Learning rate finder output for the one-step-ahead-forecast model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitting a one-step forecast model on vic_elec. . . . . . . . One-step ahead forecast on the last month of test set. . . . . Learning rate finder output for multiple-step prediction. . . Fitting a multiple-step forecast model on vic_elec. . . . . . A sample of week-long forecasts on the last month of test set. The spoken word “bird”, in time-domain representation. . . The spoken word “bird”, in frequency-domain representation. The spoken word “bird”: Spectrogram. . . . . . . . . . . . . Learning rate finder, run on the baseline model. . . . . . . . Fitting the baseline model. . . . . . . . . . . . . . . . . . . . Alluvial plot, illustrating which categories were confused most often. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
182 188 196 198 215 217 218 230 231 236 237 238 242 244 245 249 251 253 258 259 260
List of Figures 22.7 22.8 22.9 22.10 22.11 22.12 22.13 22.14
Mel filter bank with sixteen filters, as applied to 257 Fourier coefficients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fourier and Mel coefficients, compared on one window of the “bird” spectrogram. . . . . . . . . . . . . . . . . . . . . . . . Learning rate finder, run on the Mel-transform-enriched model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fitting the Mel-transform-enriched model. . . . . . . . . . . Alluvial plot for the Mel-transform-enriched setup . . . . . . Learning rate finder, run on the complex-spectrogram model. Fitting the complex-spectrogram model. . . . . . . . . . . . Alluvial plot for the complex-spectrogram setup. . . . . . . .
24.1 Timing least-squares algorithms, by example. 26.1 26.2
26.3
26.4
26.5 26.6
26.7 26.8
26.9 26.10
26.11
xiii
. . . . . . . .
Pure cosine that accomplishes one revolution over the complete sample period (64 samples). . . . . . . . . . . . . . Real parts, imaginary parts, magnitudes, and phases of the Fourier coefficients, obtained on a pure cosine that performs a single revolution over the sampling period. Imaginary parts as well as phases are all zero. . . . . . . . . . . . . . . . . . . A pure cosine that performs four revolutions over the sampling period, and its DFT. Imaginary parts and phases are still are zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A pure cosine that performs thirty-two revolutions over the sampling period, and its DFT. This is the highest frequency where, given sixty-four sample points, no aliasing will occur. Imaginary parts and phases still zero. . . . . . . . . . . . . . Pure cosine with four revolutions over the sampling period, and doubled amplitude. Imaginary parts and phases still zero. Delaying a pure cosine wave by 𝜋/2 yields a pure sine wave. Now the real parts of all coefficients are zero; instead, non-zero imaginary values are appearing. The phase shift at those positions is 𝜋/2. . . . . . . . . . . . . . . . . . . . . . . Superposition of pure sinusoids, and its DFT. . . . . . . . . Spectrogram, created by sox resources/dial.wav -n spectrogram -m -l -w kaiser -o dial-spectrogram. png. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Three consecutive ringtones, time domain representation. . . DFT of the ringtone signal, computed by means of torch_fft_fft(). Displayed are the magnitudes of frequencies below the Nyquist rate. . . . . . . . . . . . . . . DFT of the ringtone signal, using our hand-written code. Displayed are the magnitudes of frequencies below the Nyquist rate. . . . . . . . . . . . . . . . . . . . . . . . . . . .
263 264 265 266 267 270 270 271 299 321
323
328
330 331
332 334
337 338
339
340
xiv
List of Figures 26.12 Reconstruction of the time domain signal from the output of dft(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.1 27.2
341
Benchmarking various FFT and DFT implementations (see text). Also includes torch_fft_fft() for reference. . . . . . 356 Exploring the effect of Just-in-Time Compilation (JIT) on the performance of fft_vec(), fft_matrix(), and torch_fft_fft(). . . . . . . . . . . . . . . . . . . . . . . . 360
28.1 A Morlet wavelet. . . . . . . . . . . . . . . . . . . . . . . . . 28.2 Morlet wavelet: Effects of varying scale and analysis frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.3 An example signal, consisting of a low-frequency and a high-frequency half. . . . . . . . . . . . . . . . . . . . . . . . 28.4 Wavelet Transform of the above two-part signal. Analysis frequency is 100 Hertz. . . . . . . . . . . . . . . . . . . . . . 28.5 Wavelet Transform of the above two-part signal, with K set to twenty instead of two. . . . . . . . . . . . . . . . . . . . . . . 28.6 A signal, consisting of alternating, different-amplitude low-frequency and high-frequency halves. . . . . . . . . . . . 28.7 Wavelet Transform of the above alternating-frequency signal. Analysis frequency is 100 Hertz. . . . . . . . . . . . . . . . . 28.8 Wavelet Transform of the above alternating-frequency signal. Analysis frequency is 200 Hertz. . . . . . . . . . . . . . . . . 28.9 Wavelet diagram of the above alternating-frequency signal for K = 12. Displaying magnitude as per default. . . . . . . . . 28.10 Wavelet diagram of the above alternating-frequency signal for K = 12. Displaying magnitude squared. . . . . . . . . . . . . 28.11 Wavelet diagram of the above alternating-frequency signal for K = 12. Displaying the square root of the magnitude. . . . . 28.12 Wavelet diagram of the above alternating-frequency signal for K = 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.13 Wavelet diagram of the above alternating-frequency signal for K = 24. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.14 Wavelet diagram of the above alternating-frequency signal for K = 48. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.15 Chaffinch’s song. . . . . . . . . . . . . . . . . . . . . . . . . 28.16 Chaffinch’s song, Fourier spectrum (excerpt). . . . . . . . . . 28.17 Chaffinch’s song, spectrogram. . . . . . . . . . . . . . . . . . 28.18 Chaffinch’s song, wavelet diagram. . . . . . . . . . . . . . . .
363 365 367 368 369 372 374 375 378 379 380 380 381 382 384 384 386 387
Preface
This is a book about torch, the R interface to PyTorch. PyTorch, as of this writing, is one of the major deep-learning and scientific-computing frameworks, widely used across industries and areas of research. With torch, you get to access its rich functionality directly from R, with no need to install, let alone learn, Python. Though still “young” as a project, torch already has a vibrant community of users and developers; the latter not just extending the core framework, but also, building on it in their own packages. In this text, I’m attempting to attain three goals, corresponding to the book’s three major sections. The first is a thorough introduction to core torch: the basic structures without whom nothing would work. Even though, in future work, you’ll likely go with higher-level syntactic constructs when possible, it is important to know what it is they take care of, and to have understood the core concepts. What’s more, from a practical point of view, you just need to be “fluent” in torch to some degree, so you don’t have to resort to “trial-and-error-programming” too often. In the second section, basics explained, we proceed to explore various applications of deep learning, ranging from image recognition over time series and tabular data to audio classification. Here, too, the focus is on conceptual explanation. In addition, each chapter presents an approach you can use as a “template” for your own applications. Whenever adequate, I also try to point out the importance of incorporating domain knowledge, as opposed to the not-uncommon “big data, big models, big compute” approach. The third section is special in that it highlights some of the non-deep-learning things you can do with torch: matrix computations (e.g., various ways of solving linear-regression problems), calculating the Discrete Fourier Transform, and wavelet analysis. Here, more than anywhere else, the conceptual approach is very important to me. Let me explain. For one, I expect that in terms of educational background, my readers will vary quite a bit. With R being increasingly taught, and used, in the natural sciences, as well as other areas close to applied mathematics, there will be those who feel they can’t benefit much from a conceptual (though formula-guided!) explanation of how, say, the Discrete Fourier Transform works. To others, however, much of this may be uncharted territory, never to be entered if all xv
xvi
Preface
goes its normal way. This may hold, for example, for people with a humanist, not-traditionally-empirically-oriented background, such as literature, cultural studies, or the philologies. Of course, chances are that if you’re among the latter, you may find my explanations, though concept-focused, still highly (or: too) mathematical. In that case, please rest assured that, to the understanding of these things (like many others worthwhile of understanding), it is a long way; but we have a life’s time. Secondly, even though deep learning has been “the” paradigm of the last decade, recent developments seem to indicate that interest in mathematical/domain-based foundations is (again – this being a recurring phenomenon) on the rise. (Consider, for example, the Geometric Deep Learning approach, systematically explained in Bronstein et al. (2021), and conceptually introduced in Beyond alchemy: A first look at geometric deep learning1 .) In the future, I assume that we’ll likely see more and more “hybrid” approaches that integrate deep-learning techniques and domain knowledge. The Fourier Transform is not going away. Last but not least, on this topic, let me make clear that, of course, all chapters have torch code. In case of the Fourier Transform, for example, you’ll see not just the official way of doing this, using dedicated functionality, but also, various ways of coding the algorithm yourself – in a surprisingly small number of lines, and with highly impressive performance. This, in a nutshell, is what to expect from the book. Before I close, there is one thing I absolutely need to say, all the more since even though I’d have liked to, I did not find occasion to address it much in the book, given the technicality of the content. In our societies, as adoption of machine/deep learning (“AI”) is growing, so are opportunities for misuse, by governments as well as private organizations. Often, harm may not even be intended; but still, outcomes can be catastrophic, especially for people belonging to minorities, or groups already at a disadvantage. Like that, even the inevitable, in most of today’s political systems, drive to make profits results in, at the very least, societies imbued with highly questionable features (think: surveillance, and the “quantification of everything”); and most likely, in discrimination, unfairness, and severe harm. Here, I cannot do more than draw attention to this problem, point you to an introductory blog post that perhaps you’ll find useful: Starting to think about AI Fairness2 , and just ask you to, please, be actively aware of this problem in public life as well as your own work and applications. Finally, let me end with saying thank you. There are far too many people to thank that I could ever be sure I haven’t left anyone out; so instead I’ll keep this short. I’m extremely grateful to my publisher, CRC Press (first and foremost, David Grubbs and Curtis Hill) for the extraordinarily pleasant 1 https://blogs.rstudio.com/ai/posts/2021-08-26-geometric-deep-learning/ 2 https://blogs.rstudio.com/ai/posts/2021-07-15-ai-fairness/
Preface
xvii
interactions during all of the writing and editing phases. And very special thanks, for their support related to this book as well as their respective roles in the process, go to Daniel Falbel, the creator and maintainer of torch, who in-depth reviewed this book and helped me with many technical issues; Tracy Teal, my manager, who supported and encouraged me in every possible way; and Posit (formerly, RStudio), my employer, who lets me do things like this for a living. Sigrid Keydana
Author Biography
Sigrid Keydana is an Applied Researcher at Posit (formerly RStudio, PBC). She has a background in the humanities, psychology, and information technology, and is passionate about explaining complex matters in a concepts-first, comprehensible way.
xix
Part I
Getting Familiar with Torch
1 Overview
This book has three sections. The second and third sections will explore various deep learning applications and essential scientific computation techniques, respectively. Before though, in this first part, we are going to learn about torch’s basic building blocks: tensors, automatic differentiation, optimizers, and modules. I’d call this part “torch basics”, or, following a common template, “Getting started with torch”, were it not for a certain false impression this could create. These are basics, true, but basics in the sense of foundations: Having worked through the next chapters, you’ll have solid conceptions about how torch works, and you’ll have seen enough code to feel comfortable experimenting with the more involved examples encountered in later sections. In other words, you’ll be, to some degree, fluent in torch. In addition, you’ll have coded a neural network from scratch – twice, even: One version will involve just raw tensors and their in-built capabilities, while the other will make use of dedicated torch structures that encapsulate, in an object-oriented way, functionality essential to neural network training. As a consequence, you’ll be excellently equipped for part two, where we’ll see how to apply deep learning to different tasks and domains.
DOI: 10.1201/9781003275923-1
3
2 On torch, and How to Get It
2.1
In torch World
torch is an R port of PyTorch, one of the two (as of this writing) mostemployed deep learning frameworks in industry and research. By its design, it is also an excellent tool to use in various types of scientific computation tasks (a subset of which you’ll encounter in the book’s final part). It is written entirely in R and C++ (including a bit of C). No Python installation is required to use it. On the Python (PyTorch) side, the ecosystem appears as a set of concentric cycles. In the middle, there’s PyTorch itself, the core library without which nothing could work. Surrounding it, we have the inner circle of what could be called framework libraries, dedicated to special types of data (images, sound, text …), or centered on workflow tasks, like deployment. Then, there is the broader ecosystem of add-ons, specializations, and libraries for whom PyTorch is a building block or a tool. On the R side, we have the same “heart” – all depends on core torch – and we do have the same types of libraries; but the categories, the “circles”, appear less clearly set off from each other. There are no strict boundaries. There’s just a vibrant community of developers, of diverse origin and with diverse goals, working to further develop and extend torch, so it can help more and more people accomplish their various tasks. The ecosystem growing so quickly, I’ll refrain from naming individual packages – at any time, visit the torch website1 to see a featured subset. There are three packages, though, that I will name here, since they are used in the book: torchvision, torchaudio, and luz. The former two bundle domain-specific transformations, deep learning models, datasets, and utilities for images (incl. video) and audio data, respectively. The third is a high-level, intuitive, nice-to-use interface to torch, allowing to define, train, and evaluate a neural network in just a few lines of code. Like torch itself, all three packages can be installed from CRAN.
1 https://torch.mlverse.org/packages/
DOI: 10.1201/9781003275923-2
5
6
2 On torch, and How to Get It
2.2
Installing and Running torch
torch is available for Windows, MacOS, and Linux. If you have a compatible GPU and the necessary NVidia software installed, you can benefit from significant speedup, a speedup that will depend on the type of model trained. All examples in this book, though, have been chosen so they can be run on the CPU, without posing taxing demands on your patience. Due to their often-transient character, I won’t elaborate on compatibility issues here in the book; analogously, I’ll refrain from listing concrete installation instructions. At any time, you’ll find up-to-date information in the vignette2 ; and you’re more than welcome, should you encounter problems or have questions, to open an issue in the torch GitHub repository.
2 https://cran.r-project.org/web/packages/torch/vignettes/installation.html
3 Tensors
3.1
What’s in a Tensor?
To do anything useful with torch, you need to know about tensors. Not tensors in the math/physics sense. In deep learning frameworks such as TensorFlow and (Py-)Torch, tensors are “just” multi-dimensional arrays optimized for fast computation – not on the CPU only but also on specialized devices such as GPUs and TPUs. In fact, a torch tensor is like an R array, in that it can be of arbitrary dimensionality. But unlike array, it is designed for fast and scalable execution of mathematical calculations, and you can move it to the GPU. (It also has an extra capability of enormous practical impact – automatic differentiation – but we reserve that for the next chapter.) Technically, a tensor feels a lot like an R6 object, in that you can access its fields and methods using $-syntax. Let’s create one and print it: library(torch) t1 % as.numeric() %>% torch_tensor() %>% print(n = 7) torch_tensor 120 74 102 10 102 102 102 ... [the output was truncated (use n=-1 to disable)] [ CPUFloatType{59855} ] True, this works well technically. It does, however, reduce information. For example, the first and third locations are “south san francisco” and “san francisco”, respectively. Once converted to factors, these are just as distant, semantically, as are “san francisco” and any other location. Again, whether this is of relevance depends on the specifics of the data, as well as your goal. If you think it does matter, you have a range of options, including, for example, grouping observations by some criterion, or converting to latitude/longitude. These considerations are by no means torch-specific; we just mention them here because they affect the “data ingestion workflow” in torch. Finally, no excursion into the world of real-life data science is complete without a consideration of NAs. Let’s see: torch_tensor(c(1, NA, 3)) torch_tensor 1 nan 3 [ CPUFloatType{3} ] R’s NA gets converted to NaN. Can you work with that? Some torch function can. For example, torch_nanquantile() just ignores the NaNs: torch_nanquantile(torch_tensor(c(1, NA, 3)), q = 0.5)
3.3 Operations on Tensors
19
torch_tensor 2 [ CPUFloatType{1} ] However, if you’re going to train a neural network, for example, you’ll need to think about how to meaningfully replace these missing values first. But that’s a topic for a later time.
3.3
Operations on Tensors
We can perform all the usual mathematical operations on tensors: add, subtract, divide…. These operations are available as functions (starting with torch_) as well as as methods on objects (invoked with $-syntax). For example, the following are equivalent: t1