Building reproducible analytical pipelines with R

Build reproducible analytical pipelines to output consistent, high-quality data products using R, Github and Docker. Lea

293 46 9MB

English Pages 522 Year 2023

Table of contents :
Welcome!
How using a few ideas from software engineering can help data scientists, analysts and researchers write reliable code
Preface
1 Introduction
1.1 Who is this book for?
1.2 What is the aim of this book?
1.3 Prerequisites
1.4 What actually is reproducibility?
1.4.1 Using open-source tools to build a RAP is a hard requirement
1.4.2 There are hidden dependencies that can hinder the reproducibility of a project
1.4.3 The requirements of a RAP
1.5 Are there different types of reproducibility?
2 Before we start
2.1 Essential knowledge
3 Project start
3.1 Housing in Luxembourg
3.2 Saving trapped data from Excel
3.3 Analysing the data
3.4 Your project is not done
3.4.1 How easy would it be for someone else to rerun the analysis?
3.4.2 How easy would it be to update the project?
3.4.3 How easy would it be to reuse this code for another project?
3.4.4 What guarantee do we have that the output is stable through time?
3.5 Conclusion
4 Version control with Git
4.1 Installing Git and opening a Github account
4.2 Git superbasics
4.3 Git and Github
4.4 Getting to know Github
4.5 Conclusion
5 Collaborating using Trunk-based development
5.1 Collaborating as a team
5.1.1 TBD basics
5.1.2 Handling conflicts
5.1.3 Make sure you blame the right person
5.1.4 Simplified trunk-based development
5.1.5 Conclusion
5.2 Contributing to public repositories
5.3 Further reading
6 Functional programming
6.1 Introduction
6.1.1 The state of your program
6.1.2 Predictable functions
6.1.3 Referentially transparent and pure functions
6.2 Writing good functions
6.2.1 Functions are first-class objects
6.2.2 Optional arguments
6.2.3 Safe functions
6.2.4 Recursive functions
6.2.5 Anonymous functions
6.2.6 The Unix philosophy applied to R
6.3 Lists: a powerful data-structure
6.3.1 Lists all the way down
6.3.2 Lists can hold many things
6.3.3 Lists as the cure to loops
6.3.4 Data frames
6.4 Functional programming in R
6.4.1 Base capabilities
6.4.2 purrr
6.4.3 withr
6.5 Conclusion
7 Literate programming
7.1 A quick history of literate programming
7.2 {knitr} basics
7.2.1 Set up
7.2.2 Markdown ultrabasics
7.3 Keeping it DRY
7.3.1 Generating R Markdown code from code
7.3.2 Tables in R Markdown documents
7.3.3 Parametrized reports
7.4 Conclusion
8 Conclusion of part 1
9 Rewriting our project
9.1 An Rmd for cleaning the data
9.2 An Rmd for analysing the data
9.3 Conclusion
10 Basic reproducibility: freezing packages
10.1 Recording packages’ version with {renv}
10.1.1 Daily {renv} usage
10.1.2 Collaborating with {renv}
10.1.3 {renv}’s shortcomings
10.2 Becoming an R-cheologist
10.3 Conclusion
11 Packaging your code
11.1 Benefits of packages
11.2 {fusen} quickstart
11.3 Turning our Rmds into a package
11.4 Including datasets
11.5 Installing and sharing the package
11.5.1 Code is hosted
11.5.2 Code cannot be hosted
11.5.3 Marketing your work
11.6 Conclusion
12 Testing your code
12.1 Unit testing
12.2 Assertive programming
12.3 Test-driven development
12.4 Code coverage
12.5 Conclusion
13 Build automation with targets
13.1 Introduction
13.2 {targets} quick-start
13.2.1 _targets.R’s anatomy
13.3 A pipeline is a composition of pure functions
13.4 Handling files
13.5 The dependency graph
13.6 Running the pipeline in parallel
13.7 {targets} and RMarkdown (or Quarto)
13.8 Rewriting our project as a pipeline and {renv} redux
13.9 Some little tips before concluding
13.9.1 Load every target at once
13.9.2 Get metadata information on your pipeline
13.9.3 Make a target (or the whole pipeline) outdated
13.9.4 Customize the network’s visualisation
13.9.5 Use targets from one pipeline in another project
13.9.6 Understanding this cryptic error message
13.10 Conclusion
14 Reproducible analytical pipelines with Docker
14.1 What is Docker?
14.2 A primer on Linux
14.3 First steps with Docker
14.4 The Rocker project
14.5 Dockerizing projects
14.6 Dockerizing development environments
14.6.1 Creating a base image for development
14.6.2 Sharing images through Docker Hub
14.6.3 Sharing a compressed archive of your image
14.7 Some issues of relying on Docker
14.7.1 The problems of relying so much on Docker
14.7.2 Is Docker enough?
14.8 Conclusion
15 Continuous integration and continuous deployment
15.1 CI/CD quickstart for R programmers (and others)
15.2 Running a RAP using Github Actions
15.3 Craft a dockerized dev env with GA
15.4 Run a RAP using a dockerized dev env on GA
15.5 Conclusion
16 Conclusion of part 2
17 The end
“So what?”
References

Author / Uploaded
Bruno Rodrigues

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Recommend Papers

Reproducible Econometrics Using R 978-0190900663

706 56 46KB Read more

Building ETL Pipelines with Python: Create and deploy enterprise-ready ETL pipelines by employing modern methods [1 ed.] 9781804615256

Develop production-ready ETL pipelines by leveraging Python libraries and deploying them for suitable use cases Key Fea

119 104 6MB Read more

Reproducible Data Science with Pachyderm: Learn how to build version-controlled, end-to-end data pipelines using Pachyderm 2.0 9781801074483, 1801074488

Create scalable and reliable data pipelines easily with Pachyderm Key FeaturesLearn how to build an enterprise-level rep

140 74 8MB Read more

Geohazards and Pipelines: State-of-the-Art Design Using Experimental, Numerical and Analytical Methodologies 9783030498917, 9783030498924

This book presents state-of-the-art methodologies for the design and analysis of buried steel pipelines subjected to sev

390 13 10MB Read more

Introductory Statistics with R

277 100 2MB Read more

Building CI/CD Systems Using Tekton: Develop flexible and powerful CI/CD pipelines using Tekton Pipelines and Triggers 1801078211, 9781801078214

Automate the delivery of applications using Tekton Pipelines and Triggers to deploy new releases quickly and more effici

355 61 4MB Read more

WAIC and WBIC with R Stan: 100 Exercises for Building Logic 9819938376, 9789819938377

Master the art of machine learning and data science by diving into the essence of mathematical logic with this comprehen

109 6 3MB Read more

Building CI/CD Systems Using Tekton: Develop flexible and powerful CI/CD pipelines using Tekton Pipelines and Triggers 1801078211, 9781801078214

Automate the delivery of applications using Tekton Pipelines and Triggers to deploy new releases quickly and more effici

360 80 5MB Read more

$Sparse Estimation with Math and R: 100 Exercises for Building Logic 9789811614460, 9789811614453, 9811614466$

Sparse Estimation with Math and R: 100 Exercises for Building Logic 9789811614460, 9789811614453, 9811614466

157 117 71MB Read more

$Sparse Estimation with Math and R: 100 Exercises for Building Logic [1st ed. 2021] 9811614458, 9789811614453$

Sparse Estimation with Math and R: 100 Exercises for Building Logic [1st ed. 2021] 9811614458, 9789811614453

The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather t

288 68 5MB Read more