324 31 61MB
English Pages 448 [449] Year 2023
本书版权归Arcler所有
Introduction to R Programming Language
本书版权归Arcler所有
本书版权归Arcler所有
Introduction to R Programming Language
Mohsen Nady
www.arclerpress.com
Introduction to R Programming Language Mohsen Nady
Arcler Press 224 Shoreacres Road Burlington, ON L7L 2H2 Canada www.arclerpress.com Email: [email protected] e-book Edition 2022 ISBN: 978-1-77469-224-0 (e-book)
This book contains information obtained from highly regarded resources. Reprinted material sources are indicated and copyright remains with the original owners. Copyright for images and other graphics remains with the original owners as indicated. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data. Authors or Editors or Publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The authors or editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify.
Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement. © 2022 Arcler Press ISBN: 978-1-77469-039-0 (Hardcover)
Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at www.arclerpress.com
本书版权归Arcler所有
ABOUT THE AUTHOR
Mohsen Nady is a pharmacist with a M.D. in Microbiology and a diploma in Industrial Pharmacy. In addition, Mohsen has more than 4 years experience using R programming language. Mohsen has applied his skills in R programming to different projects related to Genomics, Microbiology, Biostatistics, Six Sigma, Data Analytics, Data Visualization, Building Apps, Geography, Market Analysis, Business Analysis,…..etc. Mohsen also published his thesis in high impact journal that attracted many citations, where all the statistical analysis were performed by him in addition to the methodological part. Furthermore, Mohsen has earned additional certificates, from top universities (Harvard, Johns Hopkins, Denmark,...etc) in R programming, Python, Excel, and Minitab that highlight his outstanding programming skills.
本书版权归Arcler所有
本书版权归Arcler所有
TABLE OF CONTENTS
List of Abbreviations .............................................................................................xi Preface........................................................................ .......................................xiii Chapter 1
Installing R and Rstudio............................................................................. 1 1.1 Installing R........................................................................................... 2 1.2 Installing Rstudio ................................................................................. 3
Chapter 2
Getting Started with R and Rstudio ........................................................... 5 2.1 The R Console...................................................................................... 6 2.2 Rstudio ................................................................................................ 7
Chapter 3
Objects and Files ..................................................................................... 15 3.1 Working at R Console ........................................................................ 16 3.2 R Objects........................................................................................... 21 3.3 Files and Workspaces......................................................................... 25
Chapter 4
Vectors and Lists ..................................................................................... 29 4.1 Numeric Vectors ................................................................................ 30 4.2 Integer Vectors ................................................................................... 40 4.3 Character Vectors ............................................................................... 41 4.4 Logical Vectors .................................................................................. 45 4.5 Complex Vectors ................................................................................ 49 4.6 Implicit Coercion ............................................................................... 50 4.7 Explicit Coercion ............................................................................... 52 4.8 Lists ................................................................................................... 53
Chapter 5
本书版权归Arcler所有
Matrices and Dataframes ........................................................................ 57 5.1 Building Matrices with Matrix() Function ........................................... 58 5.2 cbind() and rbind() Functions ............................................................. 68
5.4 data.frame() Function ......................................................................... 77 5.5 Examining the Structure of Built In R Dataframes ............................... 80 Chapter 6
Factors and Missing Values ..................................................................... 91 6.1 Factor() Function ................................................................................ 92 6.2 Table() and prop.table() Functions ...................................................... 96 6.3 Cut() Function .................................................................................. 112 6.4. Split() Function ............................................................................... 125 6.5 Quantile() Function.......................................................................... 135 6.6 Missing Values ................................................................................. 144
Chapter 7
Subsetting Objects ................................................................................ 151 7.1 Subsetting Vectors ............................................................................ 152 7.2 Subsetting Matrices .......................................................................... 169 7.3 Subsetting Lists ................................................................................ 177 7.4 Subsetting Dataframes ..................................................................... 194 7.5 Sorting Objects ................................................................................ 212 7.6 Removing Na Values ........................................................................ 221
Chapter 8
Dates and Times .................................................................................... 225 8.1 Dates ............................................................................................... 226 8.3 Lubridate Package ............................................................................ 232 8.4 Making Dates from Individual Components ..................................... 241
Chapter 9
Importing Data...................................................................................... 255 9.1. Importing Comma Separated Value Files (.csv extension) into R ...... 256 9.2 Importing Excel Files (.xlx, .xlsx Extensions) into R ........................... 260 9.3 Importing Tab Separated Files (.txt Extension) into R ......................... 273
Chapter 10 Basic Data Wrangling With Tidyverse ................................................... 287
本书版权归Arcler所有
10.1 Tidy Datasets.................................................................................. 288 10.2 The “Tidyverse” Package ................................................................ 288 10.3 dplyr Package ................................................................................ 288 10.4 Tidyr Package ................................................................................. 330
viii
Chapter 11 Data Visualization Using GGPLOT2 ..................................................... 341 11.1 Introduction ................................................................................... 342 11.2 Univariate Analysis: Continuous Data ............................................ 346 11.3 Univariate Analysis: Categorical Data ............................................ 352 11.4 Bivariate Analysis: Continuous-Continuous Data ........................... 356 11.5 Bivariate Analysis: Continuous-Categorical Data ............................ 366 11.6 Bivariate Analysis: Categorical-Categorical Data ............................ 386 Chapter 12 Functions............................................................................................... 393
本书版权归Arcler所有
12.1 Functions ....................................................................................... 394 12.2 Control Structures .......................................................................... 403 12.3 Loop Functions .............................................................................. 414 Bibliography .......................................................................................... 427 Index ..................................................................................................... 429
ix
本书版权归Arcler所有
LIST OF ABBREVIATIONS
CRAN
Comprehensive R Archive Network
EST
Eastern Standard Time Zone
EWR
Newark International Airport
FHR
Fetal Heart Rate
GUI
Graphical User Interface
HTN
Hypertension
IDE
Integrated Development Environment
LGA
LaGuardia Airport
PUMS
Public Use Microdata Samples
UC
Uterine Contraction
本书版权归Arcler所有
本书版权归Arcler所有
PREFACE
This book covers some introductory steps in using R programming language as a data science tool. The data science field has evolved so much recently with incredible quantities of generated data. To extract value from those data, one needs to be trained in the proper data science skills like statistical analysis, data cleaning, data visualization, and machine learning. R is now considered the centerpiece language for doing all these data science skills because it has many useful packages that not only can perform all the previous skills, but also, has additional packages that was developed by different scientists in diverse fields. These fields include, but are not limited to, business, marketing, microbiology, social science, geography, genomics, environmental science, etc. Furthermore, R is free software and can run on all major platforms: Windows, Mac Os, and UNIX/Linux. The first two chapters involve installing and using R and RStudio. RStudio is an IDE (integrated development environment) that makes R easier to use and is more similar to SPSS or Stata. Chapters 3–8 covers the different R objects and how to manipulate them including the very popular one, dataframes. Chapter 9 is about importing different files into your R working session like text or excel files. Chapters 10 and 11 are dealing with different tidyverse packages that can do interesting summaries of different dataframes including different types of data visualizations. In the last chapter, it introduces how functions are created in R along with some control structures and useful functions. In all these chapters, many examples along with different codes and outputs are given to help your understanding of this powerful programming language. I hope this book will be great addition to your future data analysis projects.
本书版权归Arcler所有
本书版权归Arcler所有
CHAPTER 1
INSTALLING R AND RSTUDIO
CONTENTS
本书版权归Arcler所有
1.1 Installing R........................................................................................... 2 1.2 Installing Rstudio ................................................................................. 3
Introduction to R Programming Language
2
1.1 INSTALLING R R is a software environment that comes with a graphical user interface (GUI). R GUI looks more similar to the old DOS console. RStudio is an integrated development environment (IDE) that makes R easier to use and is more similar to SPSS or Stata. It includes a code editor, debugging, and visualization tools. RStudio is not R, nor does it include R when you download and install it. Therefore, to use RStudio, we first need to install R. Installing R alone is similar to buying a car, while installing R and RStudio is like buying a car with all its accessories so both are important.
1.1.1 Steps
本书版权归Arcler所有
1.
You can download R from the comprehensive R archive network (CRAN). Search for CRAN on your browser.
2.
On the CRAN page, select the version for your operating system.
3.
During installation, say yes to all defaults. This installs the basic packages you need to get started. Congratulations! You have installed R.
4.
Installing R And Rstudio
3
1.2 INSTALLING RSTUDIO 1.2.1 Steps
本书版权归Arcler所有
1.
You can start by searching for RStudio on your browser.
2.
You should find the RStudio website as shown above. Once there, click on Download RStudio.
3.
This will give you several options. Use the free Desktop version.
Introduction to R Programming Language
4
本书版权归Arcler所有
4.
Once you select this option, it will take you to a page in which the operating system options are provided. Click the link showing your operating system.
5.
Once the installation file is downloaded, click on the downloaded file to start the installation process. It is recommended to click yes on all the defaults. Congratulations! You have installed RStudio.
6.
CHAPTER 2
GETTING STARTED WITH R AND RSTUDIO
CONTENTS
本书版权归Arcler所有
2.1 The R Console...................................................................................... 6 2.2 Rstudio ................................................................................................ 7
Introduction to R Programming Language
6
R is not a programming language like C or Java. Statisticians developed it as an interactive environment for data analysis. This interactivity is an essential feature in data science because the ability to quickly explore data is a necessity for success in this field. You can save your work as scripts containing codes that can be executed at any moment. These scripts serve as a record of the analysis you performed, a key feature that facilitates reproducible work. Other attractive features of R are: • • • •
R is free. It runs on all major platforms: Windows, Mac Os, UNIX/Linux. Scripts and data objects can be shared seamlessly across platforms. It is easy for others to contribute add-ons which enables developers to share software implementations of new data science methodologies. This gives R users early access to the latest methods and to tools, which are developed, for a wide variety of disciplines, including ecology, molecular biology, social sciences and geography, just to name a few examples.
2.1 THE R CONSOLE Interactive data analysis usually occurs on the R console that executes commands as you type them. There are several ways to gain access to an R console. One way is to simply start R on your computer. The console looks something like this:
本书版权归Arcler所有
Note that the console starts with greater than sign (>).
Getting Started with R and Rstudio
7
As a quick example, try using the console as a calculator: >0.15 * 19.71 [1] 2.96 The output is 2.96. [1] is not part of the result but indicates that the result has only the first element which is the number 2.96.
2.2 RSTUDIO When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes three tabs: Environment, History, and Connections, while the bottom pane shows five tabs: File, Plots, Packages, Help, and Viewer. You can click on each tab to move across the different features.
2.2.1 Writing Codes At the R console, we type expressions. The x [1] 1
本书版权归Arcler所有
print (x)
Introduction to R Programming Language
8
Or simply type the following code: >
x
[1] 1 # is the comment character. Anything to the right of #, including # itself, is ignored. It is used to memorize yourself with what this code performs. Type the following code: >xx [1] 1 The same result as above. Note also in the right Environment pane, x is stored and its value is 1.
Another example, type the following code in the console: >xyplot (x,y) # a scatter plot of y versus x Note that value of x has changed in the environment in the top right pane. In the bottom right pane, the scatter plot appears. Note the tabs, zoom to enlarge the plot and export to save the plot as .png or .pdf file.
本书版权归Arcler所有
Getting Started with R and Rstudio
9
2.2.2 Scripts To start a new script, you can click on File, the New File, then R Script. One of the great advantages of R over point-and-click analysis software is that you can save your work as scripts. You can edit and save these scripts using a text editor as the one in R Studio. In the File tab, click New File and then R Script.
This starts a new pane on the left and it is here where you can start writing your script.
本书版权归Arcler所有
10
Introduction to R Programming Language
A next step is to give the script a name. We can do this through the editor by saving the current new unnamed script, named untitled, by clicking on the save icon.
When you ask for the document to be saved for the first time, RStudio will prompt you for a name. You want to use a descriptive name, with no spaces, only hyphens to separate words, and then followed by the suffix .R. We will call this script my-first-script.R. Note where it is saved because that will be your working directory. The working directory is where you place all your files for R. In the example below, Documents is used as the working directory.
本书版权归Arcler所有
Now we are ready to start editing our first script. Try writing this code: x