Introduction to R

R is an interpreted language, which means it can be used interactively: we can type some code, hit execute and see the results instantly. This is different from compiled languages such as C or C++ where you must first compile code into an executable file, then run it.

R as simple calculator

In [7]:
2 + 3^5
245
In [8]:
7 %/% 3
2
In [9]:
sin(0.5*pi)
1
In [10]:
1.23e3
1230

Printing results

By default, the result on the last line is displayed in the Jupyter notebook. To see other lines, use the print function.

In [11]:
print(1 * 2 * 3 *4)
print(11 %% 3 == 1)
10 > 1
[1] 24
[1] FALSE
TRUE

Getting help

Type something and hit the tab key - this will either complete the R command for you (if unique) or present a list of opitons.

Use ?topic to get a description of topic, or more verbosely, you can also use help(topic).

In [12]:
?seq

Use example to see usage examples.

In [13]:
example(seq)

seq> seq(0, 1, length.out = 11)
 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

seq> seq(stats::rnorm(20)) # effectively 'along'
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

seq> seq(1, 9, by = 2)     # matches 'end'
[1] 1 3 5 7 9

seq> seq(1, 9, by = pi)    # stays below 'end'
[1] 1.000000 4.141593 7.283185

seq> seq(1, 6, by = 3)
[1] 1 4

seq> seq(1.575, 5.125, by = 0.05)
 [1] 1.575 1.625 1.675 1.725 1.775 1.825 1.875 1.925 1.975 2.025 2.075 2.125
[13] 2.175 2.225 2.275 2.325 2.375 2.425 2.475 2.525 2.575 2.625 2.675 2.725
[25] 2.775 2.825 2.875 2.925 2.975 3.025 3.075 3.125 3.175 3.225 3.275 3.325
[37] 3.375 3.425 3.475 3.525 3.575 3.625 3.675 3.725 3.775 3.825 3.875 3.925
[49] 3.975 4.025 4.075 4.125 4.175 4.225 4.275 4.325 4.375 4.425 4.475 4.525
[61] 4.575 4.625 4.675 4.725 4.775 4.825 4.875 4.925 4.975 5.025 5.075 5.125

seq> seq(17) # same as 1:17, or even better seq_len(17)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

To search for all functions with the phrase topic, use apropos(topic).

In [14]:
apropos("seq")
  1. 'seq'
  2. 'seq_along'
  3. 'seq_len'
  4. 'seq.Date'
  5. 'seq.default'
  6. 'seq.int'
  7. 'seq.POSIXt'
  8. 'sequence'

Finally, you can always use a search engine - e.g. Googling for “how to generate random numbers from a normal distribution in R” returns:

R Learning Module: Probabilities and Distributions www.ats.ucla.edu › stat › r › modules University of California, Los Angeles Generating random samples from a normal distribution ... (For more information on the random number generator used in R please refer to the help pages for the ... R: The Normal Distribution https://stat.ethz.ch/R-manual/R-devel/library/.../Normal.html ETH Zurich Normal {stats}, R Documentation ... quantile function and random generation for the normal distribution with mean equal to mean and ... number of observations. CASt R: Random numbers astrostatistics.psu.edu/.../R/Random.html Pennsylvania State University It is often necessary to simulate random numbers in R. There are many functions available to ... Let's consider the well-known normal distribution as an example: ?Normal .... Creating functions in R, as illustrated above, is a common procedure.

Creating and using sequences

In [15]:
1:4
  1. 1
  2. 2
  3. 3
  4. 4
In [16]:
seq(2, 12, 3)
  1. 2
  2. 5
  3. 8
  4. 11
In [17]:
seq(from = 0, to = pi, by = pi/4)
  1. 0
  2. 0.785398163397448
  3. 1.5707963267949
  4. 2.35619449019234
  5. 3.14159265358979
In [18]:
seq(from = 0, to = pi, length.out = 5)
  1. 0
  2. 0.785398163397448
  3. 1.5707963267949
  4. 2.35619449019234
  5. 3.14159265358979

Basic Programming Concepts

The following sections introduce some basic programming concepts within the context of R, but they are common ingredients in any programming language

- Style
- Variable Assignment
- Data Types
- Operators
- Looping and branching
- Errors and debugging

Style

Programming style (how to name things, how much blank space to use, etc.) is an important part of creating a readable program. It’s beyond our scope to cover this in detail, but it would be a good idea to review an R style guide if you plan to write a lot of R code. Different people have slightly different conventions, but this guide by Hadley Wickham is pretty widely used:

http://adv-r.had.co.nz/Style.html

It’s ‘based’ on Google’s R style guide, but some of the recommendations are in direct conflict. The most important thing is to try to be consistent.

Variable Assignment

Assigning to variables

Variable assignments in R are made using the <- operator, but note that there is more than one assignment operator. Later, when we discuss custom functions in greater detail, we’ll talk about the “=” assignment operator. Note that while some programmers use “=” in place of “<-” (they are almost equivalent), it is considered best practice to use “<-” for assignments in R.

In [20]:
x <- 3
y <- 5

z <- x * y

z
15

Data Types

When we assign variables in R (and in any programming language), they are assigned a ‘type’ (implicitly or explicitly). Some common types are:

  • numeric

In R, this includes both floating points (decimal) and integers (e.g. 1.2, 3.14159, 15)

  • integer

Non-floating point number (e.g. 1, 2, -4, etc., but not 1.2, 3.14159,…)

  • logical

TRUE or FALSE

  • character

Words and letters. NB: numbers can be treated as characters too - we’ll do some examples that illustrate that.

In R, these are also ‘classes’, and the distinction between ‘type’ and ‘class’ is a bit abstract and won’t really help us with basic programming, so using a gross and somewhat incorrect (but workable) simplification, we’ll just refer to the above as classes. New classes can be created from these basic classes:

  • vector

This is actually the same as ‘numeric’. R sees numbers as vectors of length 1.

  • matrix

A 2-dimensional array of numerics.

  • list

Like a vector, but elements can be other than numeric (such as a collection of logicals, character strings, etc.)

  • dataframe

A 2-dimensional array that allows columns to be different types.

  • tibble

New version of dataframe. We’ll talk more about this in future lectures.

Operators

Once we have variables of different types or classes, we usually want to do something with them. That can mean anything from simple arithmetic to summaries to custom made R functions. Here, we will discuss simple operators.

Arithmetic operators
  • *, + , - , /

These all do what you expect to numbers (numeric types), but things get a little tricky when we have vectors and matrices. To be explicit, the above operators are multiplication, addition, subtraction and division, respectively.

  • %*%

This is what is known as the ‘dot’ product for vectors and matrices.

  • %%

This is the modulus operator. (When you divide an integer by another integer it returns the remainder)

  • %/%

This is the integer division operator. It returns the result of a division without the remainder.

Logical Operators
  • >, <, ==, !=, >=, <=

The above compare variables and return a ‘logical’ result (TRUE or FALSE)

  • &, |

The above are ‘AND’ and ‘OR’ logical operators, respectively. We can combine the previous operators with these to form complex logical queries.

Examples

The only way to learn R is by using R. The following exercises require a bit more knowledge than the notes describe. Remember to use TAB and ? and apropos whenever you’re stuck.

  1. Using arithmetic operators on numbers.

    a. If you put USD 100 in a bank that pays 5 percent interest compounded daily, how much will you have in the bank at the end of one year? Recall from high school math that
    
    \[ \begin{align}\begin{aligned} A = P(1 + r/n)^m\\where :math:`A` is the amount, :math:`P` is the principal, :math:`r`\end{aligned}\end{align} \]

    is the interest rate, \(n\) is the compoundings per period and \(m\) is the number of periods.

    b. What is the (Euclidean) distance between two points at (1,5) and (4,8)? Recall that the Euclidean distance between two points (x1, y1) and (x2, y2) is given by the Pythagorean theorem:
    
    \[\sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}\]

  2. Assigning variables Write the formula in part 1a using variables (i.e. P,r,n,m) and then assign those variables to the values above. Assign the result to the variable A, and then print A.

  3. Data Types Define two variables of the following classes: numeric, character, logical and integer by assigning names (of your choice) to values (of your choice). (The integer is tricky - google or use help for this).

  4. Operators Use the variables above to find the results of the operators , +, -, /, %%, %%, %/%, <, >, ==, !=, <=, >=. Which variables work with which data classes?

In [ ]:
# Your code here
In [ ]:
# Your code here
In [ ]:
# Your code here

Basic Programming continued

Loops and Branching

Sometimes, actually, nearly all the time, we want to repeat an operation or series of operations over many instances of a variable. For example, suppose I have a list of numbers and I want to double them all. This is a very simple example, and the first way I will show you to do it is exactly the wrong way in R. But we need simple examples to learn, so here we go.

There are a couple of ways to make a loop, depending on whether or not you know how many iterations will be required when you write the code.

A for loop is a loop that repeats over something called an iterator. The simplest example would be a count, say I want to do something 6 times. Then a for loop may be used. Doubling a list of known length calls for a for loop. (I’m kinda lying here… will clarify this in a bit.)

A while loop is used when conditions during execution of the code determine when the loop is terminated. Suppose we have a list of coin toss outcomes, and we want to count how many flips it took to get 3 heads. We don’t know how many iterations the loop will take, but we do have a condition for stopping.

In [22]:
## for loop example.

v <- seq(from = 2, to = 20, by = 3)

print(v)


for (i in 1:length(v)) {
    v[i] <- 2*v[i]
}

print(v)
[1]  2  5  8 11 14 17 20
[1]  4 10 16 22 28 34 40

The i in the above for loop is called an iterator. At the first step, i is initialized to 1 (because the code tells it to start at 1 and end at the length of v, with the in 1:length(v) statement. Each time the code in the body is executed, it is incremented (in this case, sequentially, one at a time). Iterators can be much fancier though. Here are a few examples:

In [23]:
## for loop example

iterator_variable <- c(1,10,32,1.45)


for ( i in iterator_variable){
    print(i)
}
[1] 1
[1] 10
[1] 32
[1] 1.45
In [24]:
## for loop example

iterator_variable <- list(1,10,32,1.45,"the", "quick brown fox")


for ( i in iterator_variable){
    print(i)
}
[1] 1
[1] 10
[1] 32
[1] 1.45
[1] "the"
[1] "quick brown fox"
In [25]:
## while loop example

samples <- rbinom(100, 1, prob = 0.2)  # generate 100 Bernouli trials with success probability 0.2
i <- 0  #initialize index for samples vector


found_success <- 'FALSE'  #initialize the loop termination condition

while (found_success == 'FALSE'){
       i <- i+1
       if (samples[i] == 1)
           found_success <- 'TRUE'
}

print(samples)   # print all trials
print(i)         # print index of first success.
  [1] 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1
 [75] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
[1] 4

The above code also contains an example of ‘branching’. Branching is when the code that executes depends upon the current state of the program. In this example, the command found_success <- 'TRUE' only executes when the current (ith) component of samples is a 1.

Examples

  1. Write a for loop to print the index of each success in samples.
  2. Write a while loop to find the index of the third success.

The Next Level - vectors, matrices, functions, etc.

Avoiding loops: Vectorization

In R, and in other scripting languages, it is usually best to avoid loops when possible. Historically, looping in interpreted languages was slow. Nowadays, the issue is more with programming style, readability and lowering the possibilty of errors.

We avoid loops by using what is called ‘vectorization’. In other words, we use functions that work on whole vectors at once, rather than one element at a time (of course, at the lowest level, the code is iterating over each element. It is just happening using very fast machine code.)

Many operations in R are vectorized. For example, we just used a for loop to double the elements of a vector. We could have done this another way:

In [26]:
## Vectorization example

v <- seq(from = 2, to = 20, by = 3)

print(v)

print (2*v)

[1]  2  5  8 11 14 17 20
[1]  4 10 16 22 28 34 40

R interprets the multiplication operator to do componentwise multiplication (because it’s the only thing that ‘makes sense’). Suppose we have two vectors of the same length. What does * do?

In [27]:
v <- seq(from = 2, to = 20, by = 3)

w <- seq(from = 1, to = 21, by = 3)

print(v)
print(w)

print(v*w)
[1]  2  5  8 11 14 17 20
[1]  1  4  7 10 13 16 19
[1]   2  20  56 110 182 272 380

Again, multiplication is componentwise. There is another way to multiply 2 vectors of the same size:

In [28]:
print(v %*% w)
     [,1]
[1,] 1022

This is called the ‘dot product’. The ith component of v is multiplied by the ith component of w, and these are added together for all i. Because linear algebra is beyond our scope, we won’t deal with this operation much.

R has vectorization built in to many functions. In practice, if you ever feel you need to code a loop, you should look do some googling and see if there is another way.

We’ll first talk a bit more about vectors, matrices and lists, and then we will cover some common R functions that take these as arguments.

Examples

  1. Create the vector (1,2,3,4,5,6) using the seq command in R.
  2. Add 2 to every element in the vector.
  3. Square every element of the original vector.

Vectors, lists, and matrices

Linear algebra is not a pre-requisite for this course, so for our purposes, we will think of vectors as lists of numbers and matrices as two-dimensional arrays of numbers. They are much, much more than that, but that material is very much beyond the scope of this course. We will also mention (for those for whom it is meaningful) that vectors can be multiplied (in various ways) and matrices and vectors are related objects, and we can also multiply matrices together as well as multiply vectors and matrices.

Creating vectors lists and matrices
  • The concatenate function in R builds vectors. We can also obtain vectors from other functions as we have seen in the start of these notes with the function seq
  • Matrices are built with the command matrix. They can also be constructed from vectors or returned from certain functions.
  • Lists are created with the list function.
In [47]:
## Example of vector creation

v <- c(1,2,3,4)
print(v)

w <- c("This", "is", "a", "character", "vector")
print(w)
[1] 1 2 3 4
[1] "This"      "is"        "a"         "character" "vector"
In [49]:
## Example of matrix creation

## First just create a vector of entries
matrix_entries <- c(1,2,3,4,5,6,7,8,9,10,11,12)

M <- matrix(matrix_entries, ncol = 3, nrow = 4)
print(M)

matrix_entries <- c("This", "is", "a", "character", "matrix", "with", "two", "rows")
N <-  matrix(matrix_entries, nrow = 2)
print(N)
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
     [,1]   [,2]        [,3]     [,4]
[1,] "This" "a"         "matrix" "two"
[2,] "is"   "character" "with"   "rows"
In [31]:
## We can create matrices of different dimensions

N <- matrix(matrix_entries, ncol = 2, nrow = 6)
print(N)
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    4   10
[5,]    5   11
[6,]    6   12
In [32]:
## add rows or columns with row bind (``rbind``) or column bind (``cbind``)

## First create a vector of the appropriate length:

new_column <- rep(1,4)

P <- cbind(M, new_column)
print(P)
            new_column
[1,] 1 5  9          1
[2,] 2 6 10          1
[3,] 3 7 11          1
[4,] 4 8 12          1

Examples

  1. Create 3 vectors, one using concatenate, one using rep and one using seq
  2. Create 3 matrices: one 2x2, one 6x4, one a 4x4 with ones on the diagonal and zeroes everywhere else.
  3. Use cbind and rbind to add one row and one column to the 6x4 matrix.
In [33]:
# Your code here

Common Operations on Vectors and Matrices

Two major categories of vector operations are

  • Componentwise operations
  • Summary operations

We have already seen some componentwise operations (e.g. the doubling example). Let’s look at some summary operations and we will do some more componentwise in the examples.

In [34]:
## Sum of the components of a vector

v <- c(1,2,5,8,9,10)
sum(v)
35
In [35]:
## mean

mean(v)
5.83333333333333
In [36]:
## median

median(v)
6.5

Examples

  1. Create two vectors length 3 using the concatenate function. What happens when you add them using +?
  2. Create a third vector of length 2. What happens if you try to add that vector to one of the others?
  3. Create another vector of length 4. Add it to the vector of length 2. What do you observe?
  4. We have examples of sum, mean and median summary operations. Can you find some others?
  5. Try the functions sort and sample. What do they do? Try sample with and without replacement.
In [37]:
## Your code here

Functions

A function is code that behaves like a black box - you give it some input and it returns some output. Functions are essential because they allow you to design potentially complicated algorithms without having to keep track of the algorithmic details - all you need to remember are the function name (which you can search with apropos or by prepending ?? to the search term), the input arguments (TAB provides these), and what type of output will be generated. Most of R’s usefulness comes from the large collection of functions that it provides - for example, library(), seq(), as.fractions() are all functions. You can also write your own R functions - this will be covered in detail in a later session - but we show the basics here.

Creating a function

A function has the following structure

function_name <- function(arguments) {
    body_of_function
    return(result)
}

function_name is defined by the user (you). It can be anything you like, but meaningfulnames are encouraged to make code more readable to humans.

function is a keyword (one that has special meaning to R). It tells R that this is a function you are defining (as opposed to a variable).

arguments are inputs to the function. Usually, when you write a function, you want it to do something to a variable or variables.

body_of_function is a list of R commands. They are whatever you wish this function to do with the arguments.

return is a keyword. It tells R to send something back to the code that ‘called’ it.

result is user-defined. It is whatever you want R to send back to the code that called this function.

The simple example below will make this explicit.

In [43]:
MyFunc <- function(a, b) {
    c <- a * b
    d <- a + 2*b
    return(d)
}

function_name is MyFunc

arguments are a and b

body_of_function is

c <- a * b
d <- a + 2 * b

result is d

In [44]:
MyFunc(5, 4)
13
In [45]:
MyFunc(b = 4, a = 5)
13
In [46]:
MyFunc(4, 5)
14

Examples

  1. Write a function that takes as input principal, interest rate, compoundings per period and number of periods and returns current amount.
  2. Write a function that takes as input two points, x and y, and returns the distance between them.

Lists and Data Frames

A more general data structure in R is something called a list. Lists can have elements that are pretty much anything. For example, we can create a list of three members whose first element is a character vector, second element is a matrix and a third is a numeric vector. There is no restriction on length or type. Lists can even have other lists as elements.

This is important, because Bioconductor packages often make use of something called S4 objects. These are really just complicated lists (with functions that act on them).

Simple list

In [61]:
a <- c("a", "character", "vector")
b <- matrix(1:12, nrow = 4)
c <- 2:5

print(a)
print(b)
print(c)

my_list <- list(first = a, second = b, third = c)

print(my_list)
[1] "a"         "character" "vector"
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
[1] 2 3 4 5
$first
[1] "a"         "character" "vector"

$second
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

$third
[1] 2 3 4 5

Note hoe the elements are named. Writing “first = ” names the first element ‘first’. Now, we can access elements of the list in different ways:

In [62]:
my_list[1]
$first =
  1. 'a'
  2. 'character'
  3. 'vector'
In [63]:
my_list["second"]
$second =
1 5 9
2 6 10
3 7 11
4 8 12
In [64]:
my_list$third
  1. 2
  2. 3
  3. 4
  4. 5

The ‘$’ is probably the most often used, but there is a lot of old R code out there where lists are accessed by the element number (you would also use element number when looping over a list, but you should probably not be doing that anyway!)

Data frames

Data frames are lists where each element has the same number of columns. This is the data structure most often seen in R code, and many functions are specialized to work on them easily. We can create one from scratch:

In [65]:
names <- c("bast", "dash", "chestnut", "tibbs")
species <- c("cat", "cat", "dog", "dog")
ages <- c(9, 9, 7, 8)

df <- data.frame(name = names, species = species, age = ages)
df
namespeciesage
bast cat 9
dash cat 9
chestnutdog 7
tibbs dog 8

R also has many built-in data frames that can be used to experiment or for examples.

In [66]:
head(iris)

hist(iris$Petal.Length)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
../_images/janice_3_IntroductionToR_74_1.png

We can also create a data frame from a .csv file.

In [60]:
titanic_df <- read.csv("titanic.csv")
head(titanic_df)
Warning message in file(file, "rt"):
“cannot open file 'titanic.csv': No such file or directory”
Error in file(file, "rt"): cannot open the connection
Traceback:

1. read.csv("titanic.csv")
2. read.table(file = file, header = header, sep = sep, quote = quote,
 .     dec = dec, fill = fill, comment.char = comment.char, ...)
3. file(file, "rt")

Examples

  1. Explore the data frame mtcars. What variables does it contain? What are their data types?
  2. What is the average miles-per-gallon for a car with 4 cylinders? How about 8?
  3. Plot a histogram of the vehicle weight.
In [67]:

mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX421.0 6 160 110 3.90 2.62016.460 1 4 4
Mazda RX4 Wag21.0 6 160 110 3.90 2.87517.020 1 4 4
Datsun 71022.8 4 108 93 3.85 2.32018.611 1 4 1
Hornet 4 Drive21.4 6 258 110 3.08 3.21519.441 0 3 1
Hornet Sportabout18.7 8 360 175 3.15 3.44017.020 0 3 2
Valiant18.1 6 225 105 2.76 3.46020.221 0 3 1