R is very a powerful tool. While it is unquestionably a wonderful thing that everyone can have access to this powerful tool at no monetary cost, now or ever, R’s power can also be intimidating and off-putting at the beginning. People who wish to learn R may simply not know where to begin. Worse still, some people dive in too deep at the beginning — tackling some complex analysis before they have a good handle on the basics — and find things difficult and frustrating, and may then even abandon trying to learn R at all.

To learn R, it is best to learn the fundamentals first. In what follows, we have provided sequence of steps that cover many of these. With each step, we’ll introduce some fundamental concepts or tools, and these concepts and tools can build on one another and be combined to lead to yet more concepts and tools. What we will not cover in these steps are the major topics like data wrangling, data visualization, statistical analyses, etc. These should be tackled later, but they build upon a knowledge of the fundamentals that we cover here.

Step 1: Using the R console

R is a command based system. We type commands, R translates them into machine instructions, which our computer then executes, and then we often, but not necessarily, get back some output. The commands can be typed into the R console, or else they can be put into a script and run as a batch. When learning R, it is usually best to start with typing commands in the R console.

When we open RStudio initially, our console will usually look something like this:

R version 4.1.1 (2021-08-10) -- "Kick Things"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Notice that at the bottom, there is a single line beginning with >. This is the R console’s command prompt, and it is where we type our commands. We then press Enter, and our command is executed. The output of the command, if any, is displayed on the next line or lines, and then a new > prompt appears.

Step 2: Using R as a calculator

A useful way to think about R, and not an inaccurate one either, is that it is simply a calculator. It is useful therefore to learn R by first treating it just like an ordinary handheld calculator, and then learning the ways it extends or goes beyond the capabilities of a calculator. Just as we would with a calculator, we can start using R by doing arithmetic, i.e. adding, subtracting, multiplying, dividing, and so on.

So, let’s start by adding 2 and 2. We do this by typing 2 + 2 at the command prompt, and then pressing Enter. What we see should look exactly1 as follows.

> 2 + 2
#> [1] 4

Notice that the result of the calculation is displayed on the line following the command. That line begins with a [1], and we will return to what this means below but for now it can be ignored.

Now, let’s do some more. Each time, we’ll type the command, press Enter, get the result, and then a new > prompt occurs, and we type the next command and so on. So calculating the sum of 3 and 3, 4 and 5, 10 and 17 and 5 will lead to the following:

> 3 + 3
#> [1] 6
> 10 + 17 + 5
#> [1] 32

Using history: Before we proceed, take a look at our History window in RStudio (click the “History” tab in the upper right panel in RStudio). It should look like this:

2 + 2
3 + 3
10 + 17 + 5

In other words, it is a list of everything we’ve just typed. If we click on any line here, we will select it. If we then click To Console, this line will be copied to the console, and we can press Enter to re-execute. More usefully, we can move through our history with the Up and Down arrow keys on our keyboard. Repeatedly pressing the Up arrow key will bring up the list of each command we just typed, and repeatedly pressing the Down arrow key will bring us back down. At any point as we go through the list, we can press Enter to re-run that command, which then adds that command on to the end of the History.

Let’s look at some more arithmetic. For subtraction, we use the - symbol.

> 5 - 3
#> [1] 2
> 8 - 1
#> [1] 7

Multiplication uses the asterisk symbol *.

> 5 * 5
#> [1] 25
> 2 * 4
#> [1] 8

Division uses the / symbol.

> 1 / 2
#> [1] 0.5
> 2 / 7
#> [1] 0.2857143

Exponents, i.e. raising a number to the power of another number, is accomplished with the caret symbol ^.

> 2 ^ 8
#> [1] 256
> 10 ^ 3
#> [1] 1000

Exponents also work by **, but this is less common.

> 2 ** 8
#> [1] 256
> 10 ** 3
#> [1] 1000

Just like on a calculator, we can combine the +, -, *, /, ^ operators in any commands.

> 2 + 3 - 6 / 2
#> [1] 2
> 10 * 2 - 3 / 5 ^ 2 
#> [1] 19.88

Note that precedence order of operations will be ^ followed by * or / followed by + or -, and just like on a calculator, we can use round brackets to control the order of operations.

> 2 / (3 * 2)                
#> [1] 0.3333333
> (2 / 3) * 2
#> [1] 1.333333

In the above, we were always dealing with positive values. However, we can always precede any number with a - to get its negative value.

> -2 * 3
#> [1] -6
> 10 ^ -3
#> [1] 0.001

Notice that throughout all the above above commands, we put a space between around the +, -, *, /, ^ operators. This is a matter of recommended style to improve readability, not a requirement. In other words, the following all work in exactly the same way.

> 2+3
#> [1] 5
> 2+     3
#> [1] 5
> 2  +   3            
#> [1] 5

It is just recommended to use the 2 + 3 version. On the other hand, while it is also possible to have spaces after the - sign that we use to negate a value, the recommended style is to not do so, and so we use 3 / -2 and not 3 / - 3, though both will do the same thing.

Step 3: Variables and assignment

A major step forward in using R, and a major step beyond the capabilities of a handheld calculator, is the use of variables and the assignment of values to variables. In effect, this simply allows us to store the values of calculations for later use, but that in fact is a very powerful thing.

Consider what happens when we type the following at the command prompt and then press Enter.

> (12 / 3.5) ^ 2 + (1 / 2.5) ^ 3 + (1 + 2 + 3) ^ 0.33
#> [1] 13.6254

All the constituent numbers are stored in our computer’s memory, calculations are done on them, and then the result is stored in memory. This is then displayed as the output on our screen, and because nothing further needs to be done with it, it is removed from memory. We can, however, keep this value stored in memory by assigning it to a variable. We do this with the assignment operator <-, which is a < symbol followed directly by a - symbol. The <- can be typed by key combination Alt+- (i.e Alt key and minus key together) or Option+- on the Mac.

So if we want to assign the value of the above calculation to a variable named x, we would do

> x <- (12 / 3.5) ^ 2 + (1 / 2.5) ^ 3 + (1 + 2 + 3) ^ 0.33            

Notice that on pressing Enter here, there is no output on screen. The calculation is done as normal but rather than outputting it to the screen, the value is assigned to the variable named x. If we now type the name x at the prompt, its value will be displayed.

> x
#> [1] 13.6254

We can then do calculations with this variable just like we would with any other number

> x ^ 2
#> [1] 185.6516
> x * 3.6
#> [1] 49.05145

And we can assign any of these values to new variables.

> y <- x ^ 2
> y <- x * 3.6

In general, the assignment rule is

name <- expression

The expression is any R code that returns some value. This could be the result of a calculation, as in the above examples, but it could also be the “output”, or result, of some complex statistical analysis, which is something we will see repeatedly in later sections and chapters. In the simplest case, it could also be just a single numeric or other value, e.g.,

> x <- 42

The name has to follow certain naming conventions. In all the above examples, we used a single lowercase letter, but in general it can consist of multiple characters. Specifically, it can consist of letters, which can be either lowercase or uppercase, numbers, dots, and underscores. It must, however, begin with a letter or a dot that is not followed by a number. So all of the following are acceptable.

x123
.x
x_y_z
xXx_123

But all of the following are not acceptable.

_x
.2x 
x-y-z

Although many names like x123 etc. are valid names, the recommendation is to use names that are meaningful, relatively short, without dots (using the underscore _ instead for punctuation), and primarily consisting of lowercase characters. Examples like the following are recommended.

> age <- 29
> income <- 38575.65
> is_married <- TRUE
> years_married <- 7

Step 4: Vectors

Up to now, all the values we’ve been dealing with have been single values. In general, we can have variables in R that refer to collections of values. These are known as data structures. There are many different types of data structures in R, but the two that we encounter most often are vectors and data frames. Data frames are probably the single most important data structure in R and are the default form for representing data sets for statistical analyses. But data frames are themselves collections of vectors, and so vectors are a very important fundamental data structure in R. In fact, as we will see, single values like all those used above, are actually vectors when exactly one element.

Vectors are one dimensional sequences of values. While they will often be created for us by the R functions that we use, such as by some data analysis functions, we can also create vectors ourselves using the the c() function. For example, if we want to create a vector of the first 10 primes numbers, we could do the following.

> primes <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)

To use the c() function, where c stands for combine, we simply put within it a set of values with commas between them.

Vector operations

We can now perform operations, like those we saw above, on the primes vector just as we would to a single valued variable. For example, we can do arithmetic operations.

> primes + 1
#>  [1]  3  4  6  8 12 14 18 20 24 30
> primes / 2
#>  [1]  1.0  1.5  2.5  3.5  5.5  6.5  8.5  9.5 11.5 14.5
> primes ^ 2
#>  [1]   4   9  25  49 121 169 289 361 529 841

In these cases, the operations are applied to each element of vector.

Indexing

For any vector, we can refer to individual elements using indexing operations. For example, to get the first or fifth elements of primes, we would use square brackets as following.

> primes[1]
#> [1] 2
> primes[5]
#> [1] 11

If we want to index sets of elements, rather than just individual elements, we can use vectors (made with the c() function) inside the indexing square brackets. For example, if we want to extract the 7th, 5th, and 3rd elements, in that order, we can do the following.

> primes[c(7, 5, 3)]
#> [1] 17 11  5

If we want to refer to a consecutive set of elements, such as the second to the fifth elements, we can do the following.

> primes[2:5]
#> [1]  3  5  7 11

In R, the expression n:m, where n and m are integers, gives us the vector of integers from n to m. So, for example, 2:5 means the same thing as c(2, 3, 4, 5).

If we use a negative valued index, we can refer to or all elements except one. For example, all elements of primes except the first element, or except the second element, can be obtained as follows.

> primes[-1]
#> [1]  3  5  7 11 13 17 19 23 29
> primes[-2]
#> [1]  2  5  7 11 13 17 19 23 29

If we precede a vector of indices by a minus, we’ll return all elements except those in the index vector. For example, we can get all elements except the third, fifth, or seventh elements as follows:

> primes[-c(7, 5, 3)]
#> [1]  2  3  7 13 19 23 29

Single valued vectors

R, unlike some other programming languages, does not represent single values as elementary data types. Single values are in fact just vectors with only one element, and so can be indexed, and so on, just like any other vector.

> x <- 42
> x[1]
#> [1] 42

Character vectors

Our primes vector is a sequences of decimal numbers. We can verify this by applying the class function to the vector and seeing that it is a numeric vector.

> class(primes)
#> [1] "numeric"

We can, however, have many other types of vectors. For example, we can have character strings, i.e. string of characters that are surrounded by quotation marks. These are, in fact, very widely used in R. For example, here’s vector of the names of the six nations.

> nation <- c('ireland', 'england', 'scotland', 'wales', 'france', 'italy')

This vector is of type character, which we can verify as follows.

> class(nation)
#> [1] "character"

Note that we can use single or double quotation marks for each string, as in the following example.

> nation <- c("ireland", 'england', "scotland", 'wales', 'france', 'italy')

Just like numeric vectors, we can index character vectors.

> nation[1]
#> [1] "ireland"
> nation[-2]
#> [1] "ireland"  "scotland" "wales"    "france"   "italy"
> nation[4:6]
#> [1] "wales"  "france" "italy"

We can not, however, perform arithmetic functions on character vectors. We will obtain an error if we try.

> # does not work
> nation + 2
#> Error in nation + 2: non-numeric argument to binary operator
> nation * 2 
#> Error in nation * 2: non-numeric argument to binary operator

Logical vectors

Another widely used type of vector is a logical or Boolean vector. A Boolean variable is a binary variable that takes on values of true or false. In R, these values are represented using “TRUE” or “T” for true, and “FALSE” or “F” for false. For example, a vector of values representing whether each of a set of five people is male or could be as follows.

> is_male <- c(TRUE, FALSE, TRUE, TRUE, FALSE)

This could also be created more succintly as follows.

> is_male <- c(T, F, T, T, F)

Using class, we can verify that this vector is a logical vector.

> class(is_male)
#> [1] "logical"

We can index logical vectors just like numeric or character vectors.

> is_male[2]
#> [1] FALSE
> is_male[c(1, 2)]
#> [1]  TRUE FALSE

Coercing vectors

An important property of all vectors is that they are homogeneous, in that all their elements must be of the same data type. For example, we can’t have a vector with some numbers, some logical values, and some characters. If we try to make a heterogeneous vector like this, some our elements will be coerced into other types. If we, for example, try to combine some logical values with some numbers, the logical values will be coerced into numbers (TRUE will be converted to 1, FALSE will be converted to 0).

> c(TRUE, FALSE, 3, 2, -1, TRUE)
#> [1]  1  0  3  2 -1  1

If we try to combine numbers of logical values with character strings, they will all be coerced into strings, as in the following example.

> c(2.75, 11.3, TRUE, FALSE, 'dog', 'cat')
#> [1] "2.75"  "11.3"  "TRUE"  "FALSE" "dog"   "cat"

Combining vectors

Just as we combined numbers or other single values into vectors using the combine, i.e., c() function, so too can be combine vectors with c(). For example, to combine the primes vector, with the vectors that are the squares and cubes of primes, we would do the following:

> c(primes, primes^2, primes^3)
#>  [1]     2     3     5     7    11    13    17    19    23    29     4     9    25    49   121   169
#> [17]   289   361   529   841     8    27   125   343  1331  2197  4913  6859 12167 24389

In this case we have produced a vector of 30 elements. These don’t all fit on one lines and so are wrapped over three lines. The first row displays elements 1 to 11. The second row begins with the twelfth element, and the third row begins with the 23rd element. Now, we can see the meaning of the [1] on the output that we encountered in all of the above examples. The [1] is merely the index of the first element of the vector shown on the corresponding line.

Missing values

In any vector, regardless of its type, there can be missing values. In R, missing values are denoted by NA, which can be typed inserted explicitly into any vector as in the following examples.

> x <- c(1, 2, NA, 4, 5)
> x
#> [1]  1  2 NA  4  5
> y <- c('be', NA, 'afeard')
> y
#> [1] "be"     NA       "afeard"

The NA here is not a character a string. It is a special symbol with a special meaning in R. In other words, there is an important difference between the following two vectors.

> c(1, 2, NA, 4, 5)    # this is a numeric vector, but with a missing value
#> [1]  1  2 NA  4  5
> c(1, 2, 'NA', 4, 5)  # the "NA" coerces this to be a character vector
#> [1] "1"  "2"  "NA" "4"  "5"

Step 5: Data frames

The data frame is probably the most important data structure in R. As mentioned, data frames are how we almost always represent real world data sets in R, and most statistical analysis commands, especially modern ones, assume data is provided in the form of data frames. We will provide a much more comprehensive coverage to data frames in subsequent chapters, but for now, we’ll describe some of their essential features.

Data frames are rectangular data structures: they have certain number of columns, and each column has the same number of rows. Each column is in fact a vector, and so data frames are essentially collections of equal length vectors.

Usually, data frames are created when read in the contents of a data file, such as a .csv or .xlsx file. However, for the purposes of introduction, we can produce them on the command line with the data.frame() function as in the following example.

> data_df <- data.frame(name = c('billy', 'joe', 'bob'), age = c(21, 29, 23))
> data_df
#>    name age
#> 1 billy  21
#> 2   joe  29
#> 3   bob  23

As we can see, this created a data frame with 3 rows and 2 columns. The columns are variables, and the rows are the observations of these variables.

We can refer to the elements of a data frame in different ways. The simplest is to use double indices, one for the rows, one for the columns. For example, to refer to the element in the third row of the second column

> data_df[3, 2]
#> [1] 23

We can refer to the first and third row of the second column by using a vector of indices for the rows

> data_df[c(1, 3), 2]
#> [1] 21 23

We can refer to all the elements of the third row, by leaving the second index blank.

> data_df[3, ]
#>   name age
#> 3  bob  23

We have more options available if we want to refer to one or more columns. We could, for example, use the double indices, leaving the first index blank to refer to all rows. For example to refer to the second column, we would do the following.

> data_df[,2]
#> [1] 21 29 23

One the other hand, we could also refer to the column by name. To do so, we could use the following $ notation.

> data_df$age
#> [1] 21 29 23

Here, age can be seen as a property of data_df, and so data_df$age accesses this property. An alternative syntax that accomplishes the same thing is to use double square brackets as follows.

> data_df[['age']]
#> [1] 21 29 23

If, one the other hand, we were to use a single square brackets, we would obtain the following.

> data_df['age']
#>   age
#> 1  21
#> 2  29
#> 3  23

This is a subset of data_df that is itself a data frame.

Step 6: Functions

While data structures hold data in R, functions are used to do things to or with the data. In almost all functions, we put data structures in, calculations or done to or using this data, and new data structures, perhaps just a single value, are then returned. So far, we have seen two functions: c(), which combines vectors, and data.frame(), which creates data frames. Altogether, across all R packages and R’s standard library, there are at least tens of thousands of functions available in R, and probably many more. That, of course, is an overwhelmingly large list, but at the start all we need to know is a relatively small set, and we then learn more gradually as we continue to use R.

To explore some functions, we’ll use our primes vector above. We can count the number of elements in the vector as follows.

> length(primes)
#> [1] 10

We can get the sum, mean, median, standard deviation, variance as follows.

> sum(primes)
#> [1] 129
> mean(primes)
#> [1] 12.9
> median(primes)
#> [1] 12
> sd(primes)
#> [1] 9.024042
> var(primes)
#> [1] 81.43333

Nested functions

We can nest functions as in the following example.

> round(sqrt(mean(primes)))
#> [1] 4

Here, we first calculate the mean of primes, then get its square root using sqrt, and then round the result using round.

Optional arguments

In all above the above examples, the functions take a single vector of any length as their input argument, and return a single value (vector of length one) as its output. In some cases, function make take additional arguments, which if left unspecified are given default values. For example, mean() takes an additional trim argument, which trims out a certain proportion of the extreme values of the vector, and then calculates the mean. By default, the value of trim is 0, meaning that no trimming is done by default. If, on the other hand, we want to trim out 10% of the highest and lowest values, we would set trim to 0.1 as follows.

> mean(primes, trim=0.1)
#> [1] 12.25

Function help pages

Knowing that mean() takes an additional optional argument trim, and what trim does and what is its default values, can be discovered from the help page for mean(). In fact, all functions will have a help page, which will usually tell us exactly what should go in to the function and what comes out, what the function does, and it usually also provides examples of its use. We can always access the help page of a function by preceding the function’s name with ? and pressing enter.

> ?mean

This is equivalent to typing help() with function’s name inside.

> help("mean")

Another alternative is to press the F1 key when our cursor is on the function’s name on the console.

Step 7: Scripts

Scripts are files where we write R commands, which can be then saved for later use. We can bring up RStudio’s script editor with Ctrl+Shift+N, or go to the File > New File > R script, or click on the icon on the left of the taskbar below the menu and choose R script.

Executing code in scripts

In the script in the editor, we can write any R commands that could be written and executed in the console. Unlike in the console, when we press Enter after a command we type in the editor, the command is not executed. All Enter does is move the cursor on to the next line in the script, just as would happen in any text or code editor. If we want to run a command, place the cursor anywhere on the line, and click the Run icon. This will effectively copy the line to the console and run it there. we can achieve the exact same effect by pressing Ctrl+Enter. That is, we place the cursor anywhere on line, press Ctrl+Enter, and it will be copied to the console and executed.

In a script, we can have as many lines of code as we wish, and there can be as many blank lines as we wish. For example, our script could consist of the following lines.

composites <- c(4, 6, 8, 9, 10, 12, 14, 15, 16, 18)

composites_plus_one <- composites + 1

composites_minus_one <- composites - 1

If we place the cursor on line 1, we can then click the Run icon, or press the Ctrl+Enter keys. The line is then copied to the console and executed, and then the cursor jumps to the next line of code (line 3). We can then click Run or press Ctrl+Enter again, which copies line 3 to the console and executes it, and then the cursor moves to line 5. This way, we can step through a long list of commands, executing them one by one, by just repeatedly click the Run icon, or repeatedly pressing Ctrl+Enter.

In addition to running the commands in our script line by line, we can select multiple lines with our mouse and again press the Run icon, or press Ctrl+Enter.

We may also run the entire script at once. This is known as sourcing the script. We may accomplish this by pressing the Source icon on the left of the editor’s task bar. This then runs a source() command on the console with the name (or temporary name) of our script.

Multiline commands

One reason why writings in scripts is very practically valuable, even if we don’t wish to save the scripts, is when we are write long and complex commands. For example, let’s say we are creating a data frame more than a few variables and more than a few observations per variable. In this case, we can split the command over multiple lines as in the following example.

Df <- data.frame(name = c('jane', 'joe', 'billy', 'bob', 'jim'),
                 age = c(23, 27, 24, 32, 19),
                 sex = c('female', 'male', 'male', 'male', 'male'),
                 occupation = c('doctor', 'tinker', 'tailor', 'soldier', 'spy')
)

We can execute this command as if it were on a single line by placing the cursor anywhere on any line and pressing the Run icon, or repeatedly pressing Ctrl+Enter.

When writing multiline commands, it is advisable that we use consistent indentation to increase readability. The following multiline command will work perfectly, but it is more difficult to read, and perhaps to edit too.

Df <- data.frame(name = c('jane', 'joe', 'billy', 'bob', 'jim'),
age = c(23, 27, 24, 32, 19),  
                             sex = c('female','male', 'male', 'male', 'male'),
           occupation = c('doctor', 'tinker', 'tailor', 'soldier', 'spy')
)

We can always get improved and consistent indentation by highlighting all our lines and then going to the Code menu, and selecting Reindent Lines. More efficiently, we can accomplish the same thing by highlighting the lines and then pressing Ctrl+I.

Comments

An almost universal feature of programming language is the option to write comments in the code files. A comment allows us write to notes or comments around the code that is then skipped over when the script or the code lines are being executed. This is particularly useful to write explanatory notes to ourself or to others. In R, anything following the # symbol on any line is treated as a comment. Any line starting with # will be ignored completely when the code is being run, and we can place a # at any point on a line, and anything written after it is also ignored. The following code shows some examples of the use of comments.

# Here is a data frame with four variables.
# The variables are name, age, sex, occupation.
Df <- data.frame(name = c('jane', 'joe', 'billy', 'bob', 'jim'),
                 # This line is a comment too.
                 age = c(23, 27, 24, 32, 19), # Another comment. 
                 sex = c('female','male', 'male', 'male', 'male'),
                 occupation = c('doctor', 'tinker', 'tailor', 'soldier', 'spy')
)

Code sections

We can use comments to divide up a script into sections as in the following example.

# Create vectors ----------------------------------------------------------

primes <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
composites <- c(4, 6, 8, 9, 10, 12, 14, 15, 16, 18)


# Some calculations -------------------------------------------------------

primes_plus_one <- primes + 1
composites_squared <- composites ^ 2


# Some more calculations --------------------------------------------------

primes_lt_20 <- primes[primes < 20] # prime numbers less than 20

In this example, we have created three code sections, each one defined by the section header that begins with # and ends with a sequence of dashes ----. We can always create a new section with Code -> Insert Section, or the key command Control+Shift+R. In general, however, for the RStudio editor, a code section is demarcated by any line beginning with # and any with at least four dashes ----, equal signs ====, or hashes ####.

On the one hand, a section is nothing but a set of lines preceded by a commented line that read like a section header. This, of course, helps to make the code readable. However, RStudio recognizes these sections as distinct units of the script. For starters, we can jump between sections by selecting the name of the section listed when we press the icon at the bottom left of the editor. We can also bring up the list of sections by Alt+Shift+J. In addition, the icon in the top right corner of the editor provides a list of our sections, which we can jump to by clicking. For short scripts, jumping between sections does not offer much in terms of efficiency, but this feature is particularly useful for moving between sections in longer scripts.

We may also collapse or expand any of our sections. For example, if our cursor is within the Some calculations section above, you can do Edit > Folding > Collapse to hide the contents of this section. This can also be accomplished with the keyboard shortcut Alt+L. We can expand or uncollapse the section with Edit > Folding > Expand, which can also be done with Alt+Shift+L. We can collapse all sections simultaneously with Edit > Folding > Collapse All, or Alt+O, and expand all sections simultaneously with Edit > Folding > Expand All, Alt+Shift+O.

Another particularly useful feature of code sections is we may also run all the code in a section with Code > Run Region > Run Code Section, or else with Control+Alt+T shortcut.

Saving a script

We save a script just like we would save any other file. On the editor, there’s a button at the top left of the editor window. Likewise, we can do File > Save, which is mapped to the Ctrl+S keyboard shortcut. The resulting file is just a text file, and not some special format that can only be read by R or RStudio. In can be opened in any of the countless programs that can open text files. In principle, it can be named anything we like. However, it is recommended that we only use lower case letters, numbers, and underscores — so no spaces or hyphens or dots — and that it should always end with the file extension .R. The .R file extension is automatically included if it is not already there when we save a script using RStudio.

Step 8: Installing and loading packages

There are over 18000 add-on R packages as of October, 2021. These are installed on demand, just as one installs apps on a mobile device. The easiest way to install packages is from the Packages pane, which is in the lower right window, see next figure.

There we will see a listing of all the R packages that are currently installed on our system. In the top right of the window, there is the button. This refreshes the listing to make sure that any very recently installed packages are also shown in the listing. The packages are always listed alphabetically, so we can browse through the list to see if what we want is already there. There is also a search box, which is in the top right. Start typing the name of any package there and the package listing will be filtered to match what we type.

To install a package, click the Install button on the top left. This will bring up a dialog box, shown in next figure, where we can type in the packages that we want. As soon as we start typing in the box for the package names, packages matching the characters we’ve typed will be shown. From this, we can select what we want. We can type the names of multiple packages into the install packages dialog box, separating them with a space or a comma. When we click Install, we will notice that commands are run in our console. For example, if we install the package dplyr, we will see the following command appear in our console, which will then automatically execute.

> install.packages("dplyr")

If we select multiple packages to install — for example, dplyr, tidyr, and ggplot2 — then the same install.packages command is run, but with a vector of names as the input argument.

> install.packages(c("dplyr", "tidyr", "ggplot2"))

Given that the install packages dialog box just runs these commands, we can always type them directly into the console ourself.

As soon as these commands are executed, we will see the progress of the installation in the R console. There, in the top left, a will be shown when the installation is occurring and we will see activity happening in the console. When the installation is complete, the stop sign will disappear and the normal > command prompt will return to the console.

Which packages to install?

A common question, especially when learning R initially, concerns which packages should we install? There is no general answer to this question because it heavily depends on the type of analyses we need to do or the type of data we will be working with. However, we highly recommend always installing the tidyverse collection of packages. The tidyverse describes itself as “an opinionated collection of R packages designed for data science … (that) share an underlying design philosophy, grammar, and data structures.” We will use these packages extensively from now on in this book.

Although tidyverse is a collection of packages, and these packages can be individually installed, they can also be installed en-masse by typing tidyverse into the install packages dialog box, or alternatively by typing the following in the console.

> install.packages('tidyverse')

Updating packages

R packages are downloaded from the Comprehensive R Archive Network (CRAN), which is a network of mirror servers that serve out R software and are distributed around the world. We can always check for any updates to our installed packages using the Update button at the top of our Packages window. This will list any packages with versions that are more recent that those you have installed. We can then update some or all of these packages. It is generally a good idea to periodically check for updates and update all packages if necessary.

Loading packages

Having installed any package, it is now available for use in any R session. However, it needs to be activated or loaded for this to happen. In other words, our installed packages can only be used when they are loaded into our R session. We can load them by clicking on the tick box next to the package name in the Packages listing. Unticking the package will deactivate or unload it. We will notice that when to tick or untick this box, a library() command is run on the console. This can always be typed directly, rather than clicking the tick box. For example, to load all the tidyverse packages, do

> library("tidyverse")

Throughout the remainder of this book, we will be using many packages, and we will always load them from a library() command run on the console.

Package masking

When we load a package, we are often greeted with messages like the following.

> library("dplyr")
Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

What this means that the dplyr package has loaded up two functions, filter and lag, that were previously loaded by the stats package, and four other functions, intersect, setdiff, setequal, and union, that were loaded previously by the base package. Both the stats and the base packages are loaded by default in any R session. If we now try to use, for example, filter, we will be given filter function of dplyr rather than the stats package’s filter function. In general, R searches for functions according to its search path, which we can see with the following command.

> search()
#>  [1] ".GlobalEnv"        "package:forcats"   "package:purrr"     "package:tidyr"    
#>  [5] "package:tibble"    "package:tidyverse" "package:ggplot2"   "package:readr"    
#>  [9] "package:stringr"   "package:dplyr"     "tools:rstudio"     "package:stats"    
#> [13] "package:graphics"  "package:grDevices" "package:utils"     "package:datasets" 
#> [17] "package:methods"   "Autoloads"         "package:base"

So, for example, if we search for a function named filter, we will find the filter in dplyr first, because dplyr is nearer the start of the search path than the stats package. This does not mean that the masked functions are unavailable. We can access them directly using the <package name>::<function name> syntax as in the following examples.

> stats::filter()      # The `filter` function in the `stats` package.
> base::intersect()    # The `intersect` function in the `base` package.

Step 9: Reading in and viewing data

We mentioned that when we are dealing with real data sets in R, we almost always read them in to R from files. R allows us to import data from a very large variety of data file types, including from other statistics programs like SPSS, Stata, SAS, Minitab, and so on, and common file formats like .xlsx and .csv. Here, we’ll just look at the example of importing .csv files, which is a particularly common scenario anyway, and the procedure we follow for importing the other file formats is similar.

When learning R initially, the easiest way to import data is using the Import Dataset button in the Environment pane. This will give us options to import from a number of different file types. If we are importing .csv, we can choose either From Text (base)… or From Text (readr)…. The former option imports the .csv file using the function read.csv(), which part of the base R set of packages. The latter option uses the read_csv() function from the readr package, which is part of the tidyverse package collection. Throughout this book, we recommend always using tidyverse over base R functions whenever the options are available, and so we will use the From Text (readr)… option always for importing .csv files.

Upon choosing the From Text (readr)… option, if the readr package is not yet installed, we will be asked if to install it. However, it will already be available for us if we installed tidyverse, as described in the previous step. We will then be given a typical file import dialog box, just as we would encounter in many other programs, see next figure. There we choose the file we want to use using a file browser. At the bottom left of the dialog box, there are various options about how to parse the .csv file. Often we can use leave these at their default values. However, it is often a good idea to explicitly choose a new Name for the data frame into which the contents of the .csv will be read. By default, the name of the data frame will be the file name, but often it is preferable to have a shorter name. In this book, we usually end the name of the data frame with _df, e.g. data_df, scores_df, etc. The _df ending makes it clear that the object is a data frame.

We will also notice that in bottom right, there is a Code Preview of the code that will be run in console once we click the Import button. What this shows us is that the Import Dataset dialog box is just a means to create a few lines of code that will then be run in the console. We can always write this code, or at least its essential parts, explicitly in the console or in a script. In fact, we strongly advise doing so because it is ultimately much quicker and efficient than going through a GUI dialog box. However, to do this in a re-usable way, and without having to remember or type long file paths, it is necessary to first understand the concept of the R session’s working directory, which we will deal with below.

Let’s say the .csv file that we wish to import is named weight.csv and exists in our personal Downloads directory on a Windows device. If we choose the name weight_df for the imported data, rather than the default, we should see the following code and output in the Console.

library(readr)
weight_df <- read_csv("C:/Users/andrews/Downloads/weight.csv")
View(weight_df)
#> Parsed with column specification:
#> cols(
#> subjectid = col_double(),
#> gender = col_character(),
#> height = col_double(),
#> height_selfreport = col_double(),
#> weight = col_double(),
#> weight_selfreport = col_double(),
#> age = col_double(),
#> race = col_double()
#> )

The messages displayed after the read_csv command is run (i.e., Parsed with column specification ...) tells us how many variables have been created in the resulting data frame, and of which data types these variables are. As can be seen we have 8 variables — subjectid, genderrace — and corresponding to each one is either col_double() or, in the case of gender, col_character(). The col_double indicates that the corresponding variable is a normal numerical variable (“double” refers to double floating point number, which is a decimal number represented by 8 bytes of memory). The col_characeter indicates that the gender variable is a character vector.

The last line of the code that was run when import using the Import Dataset is View(Df). This brings up a data frame viewer that looks like a spreadsheet. With this, we may obviously browse through the variables and observations, we may sort according to any variable, and using the Filter icon, we can filter the data set according to selected ranges of values of the variables. Although the View is certainly a useful tool, we can in fact view, sort, select, filter, etc., the data much more efficiently using command line tools like dplyr. This is something to which we will return in depth in the next chapter, under the topic of data wrangling. For now, we will just provide an introduction to viewing the data in the data frame.

Certainly the easiest way to look at a data frame in R is simply to type its name in the console as follows.

> weight_df
#> # A tibble: 6,068 × 8
#>    subjectid gender height height_selfreport weight weight_selfreport   age  race
#>        <dbl> <chr>   <dbl>             <dbl>  <dbl>             <dbl> <dbl> <dbl>
#>  1     10027 Male    177.6            180.34   81.5            81.670    41     1
#>  2     10032 Male    170.2            172.72   72.6            72.595    35     1
#>  3     10033 Male    173.5            172.72   92.9            93.013    42     2
#>  4     10092 Male    165.5            167.64   79.4            79.401    31     1
#>  5     10093 Male    191.4            195.58   94.6            96.642    21     2
#>  6     10115 Male    172              175.26   80.2            79.401    39     1
#>  7     10117 Male    181              182.88  116.2           113.43     32     2
#>  8     10237 Male    185              187.96   95.4            95.735    23     1
#>  9     10242 Male    177.7            177.8    99.5            99.819    36     1
#> 10     10244 Male    181.1            182.88   70.2            72.595    23     1
#> # … with 6,058 more rows

From this, we see a lot. We see that there are 6068 rows and 8 variables. We see the first 10 values of the first seven of the variables, and see that there is also one additional variable, race, whose values are not shown. We also see the corresponding data type of each of the variables. At the top of the listing, we see that the data frame is a tibble. A tibble is simply a tidyverse flavoured data frame. It is a regular data frame but with some relatively minor additional features.

The number of observations, variables, and even the number of significant digits that are shown can be controlled by setting options. For example, by default, 10 observations will be shown. If we want, say, up to 15 observations to be shown by default, we can do

> options(tibble.print_max = 15, tibble.print_min = 15)

The tibble.print_max and tibble.print_min can be different. For example, we can set them as follows.

> options(tibble.print_max = 15, tibble.print_min = 5)

In this case, any tibble with up to 15 observations will be shown in full, but if it has over 15 observations, then only the first 5 observations will be shown. The number of displayed variables, on the other hand, is set with the tibble.width option. If we always wish to see all rows of all tibble data frames, we can set the following options.

> options(tibble.print_max = Inf)

For an individual tibble data frame such as weight_df, we can show all rows as follows.

> print(weight_df, n = Inf)

By default, only the number of columns that can fit on the console will be displayed. This means, of course, that when we widen our console, we may always fit in more variables, and if we’re using a narrower console, fewer variables are shown. If we always want all variables to be displayed, wrapping them if necessary to fit, then set tibble.width to Inf.

> options(tibble.width = Inf)

We may set tibble.print_max, tibble.print_min, tibble.width back to their default behaviour as follows.

> options(tibble.print_max = NULL, tibble.print_min = NULL, tibble.width = NULL)

We can control the number of significant digits that are displayed by the pillar.sigfig option.

> options(pillar.sigfig = 5)

Step 10: Working directory, RStudio projects, and clean workspaces

Every R session has a working directory, which we can think of as the directory (aka folder) on the computer’s filesystem in which the R session is running. In other words, we can think of the R session as belonging to, or running inside, some directory somewhere on the system. By default, this is often the user’s home directory. We can always list our session’s working directory with getwd(). On Windows, this will appear something like the following.

> getwd()
> #> [1] "C:/Users/andrews"

On Macs, it will appear something like the following.

> getwd()
> #> /Users/andrews/

The practical importance of the working directory is that it is the default location from which files are read and to which files are written whenever we are using relative rather than absolute path names. The use of relative paths is necessary to allow our R scripts to be usable either by others or by ourselves on different devices.

An absolute path name gives the exact location of the file on the file system by specifying the file’s name, the directory it is in, the directory in which its directory is located, and so on, up to the root of the filesystem. For example, on a Windows machine, an absolute path name for a file might be C:/Users/andrews/Downloads/data/data.csv. From this, we see that the file data.csv is a directory named data that is in Downloads in andrews in Users on the C drive. We could read this file into R with the following command.

data_df <- read_csv("C:/Users/andrews/Downloads/data/data.csv")

This command will always work (assuming the file and directories exist) regardless of what the R session’s working directory is. A relative path name, on the other hand, might be data/data.csv. This specifies that data.csv is in a directory called data, but it does not explicitly tell us the location of data. In these situations, the location is always assumed to be the R session’s working directory. Consider the following commands.

data_1_df <- read_csv("data.csv")
data_2_df <- read_csv("data/data.csv")

The first would read in the file data.csv from the working directory, while the second would read the file data.csv from the data sub-directory in the working directory. Likewise, the command

write_csv(data_df, 'my_new_data.csv')

will write the data to the file my_new_data.csv in the current working directory.

In general, R scripts that use absolute path names are not usable either by others or by ourselves on different machines. For example, if I had the following script, it would only be usable by me on a particular device.

data_df <- read_csv("C:/Users/andrews/Downloads/data/data.csv")

some_function(data_df)

If I wanted to share this code with others, or use it myself on another device, then it would be better to use relative paths as in the following examples.

data_df <- read_csv("data.csv")

some_function(data_df)

Assuming that data.csv is the session’s working directory, this code will now work everywhere for anyone.

The working directory for the R session can always be changed by going to Session > Set Working Directory > Choose Directory, which can be obtained by the key command Ctrl+Shift+H. This allows us to set the working directory to be any directory of our choice.

RStudio projects

When we are working on a particular data analysis task, one that might involve multiple interrelated data files and scripts, it is usually a good idea to set the working directory to be the directory where all these scripts and data files are located. For example, let’s say the directory where we kept our scripts and data was data_analysis and was organized as follows.

data_analysis/
 -- script_1.R
 -- script_2.R
 -- data/ 
      -- data_1.csv
      -- data_2.csv

In this case, it would be advisable to set the working directory to be the location of data_analysis whenever we are working on these files. Inside the script_1.R or script_2.R scripts, we would then be able to include commands code such as the following.

data_1_df <- read_csv('data/data_1.csv')

Having set the working directory to the location of data_analysis, whenever these scripts are run, they would always refer to the appropriate data files.

To facilitate this, RStudio allows us to set our data_analysis directory as an RStudio project. To do so, we go to File > New Project and choose Existing Directory and then choose the data_analysis directory. Likewise, if we have other directories with their own sets of inter-related files and scripts, we can set these as separate projects too. Having set up these projects, whenever we start RStudio, we can open a project, and then switch between projects. Whenever we open a project, or switch between projects, our working directory is then automatically switched to the project’s directory. As such, our working will always be set to the appropriate directories for whichever files we are working on and it need never be set manually.

RStudio projects also allow us to have project specific settings for our command history. In other words, all the commands used whenever we are working on a particular project will be saved to a history file specific to that project which can be loaded automatically every time we open or switch to that project. We can ensure that our history is always saved by set the Always save history option in Tools > Project Options.

Clean workspaces

In general, R allows us to always save the contents of our workspace to a file, usually named .RData, when terminating our session. The contents of our workspace is the collection of objects, e.g., data sets and other variables we’ve created, that exists in our R session. This collection can be viewed in the Environment window. If we save these objects to file, they can be automatically loaded the next time we start R.

This saving and loading of the workspace may seem like a very useful feature. For example, if we’ve been working for a long session, perhaps creating multiple new data frames and other data structures, it may seem particularly useful to be able to save them all to file, and then reload them back in to our session when we start again. However, despite appearance, this may be counterproductive. We may end up with a cluttered workspace that is full of objects, many of which were intended to be only temporary. Even of those we do wish to keep, we may not be sure where they came from or how they were created. It is much better practice to always write code that creates the variables and data structures that we need. For example, rather than saving many data frames that we have derived from processing some raw data that we initially imported, it is far better to have a single script that reads the raw data and then runs a series of commands on this data, thus creating, or re-created, all the derived data frames.

RStudio projects also give us the option of always saving or never saving our workspace, and always or never loading the workspace file at startup. These options are available in Tools > Project Options. We recommend that we always set these options to never save nor load the contents of the workspace. This ensures that we always have, or can have, a clean workspace whose contents we are in control of.

At any point during our RStudio session, we can restart R itself through Session > Restart R. Doing so prior to running any script is good practice because it ensures that our script will not have any hidden dependencies, and that any data structures or functions etc that it does depend on are created from are contained in the code itself.


  1. In this document, each line of the console’s output begins with #>, which will not appear in the actual console. We use the #> just to make it easier to read and to distinguish between the commands that we input into the console and the output that is then displayed.↩︎