In [1]:
library(tidyverse)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

Custom Functions in R

What is a function?

A function, in the programming sense, is a bit of code that is organized and named so that it may be executed by simply executing the name (referred to as ‘calling’ a function). Often, functions take some values as input and return output values.

You have already met functions in R. The command ‘sum’ is a function. It takes as arguments a list of objects and adds them:

In [2]:
sum(1,2,4)
7

The sum function takes a list of numbers as arguments (inputs), adds them together and returns the result.

While there are many, many built-in functions in R, it is often desirable to write your own. It’s also helpful to be familiar with the structure of functions, to better understand how function calls work.

Anatomy of a function

The R function has the following structure

name <- function(arg1, arg2, ...) { # name of function and arguments (inputs) it takes
    body_of_function                # code that (presumably) manipulates arguments
    return(value)                   # What the function returns (result of manipulations
    }

A function is created using the function keyword, followed by a series of arguments in parentheses. The main work done by the function is enclosed witin curly braces, and a return function is used to indicate the output of the function. Finally we can assign the funciton, just like any other R object, to a named variable for later use.

Our first custom function

Let’s write a function to calculate the mean of a vector of numbers.

In [1]:
MyMean <- function(xs) {
    n <- length(xs)
    return(sum(xs/n))
}
In [2]:
MyMean(1:10)
5.5

There are a lot details to writing custom functions, such as specifying default values, position of arguments, etc.

In [3]:
## Another function example - with default arguments

WeightAvg <- function(xs, weights = rep(1,length(xs))) {

    n <- length(xs)
    weighted_vector <- xs*weights
    return(sum(weighted_vector)/n)

}
In [4]:
xs <- c(1,2,3,4,5,6,7)

WeightAvg(xs)
4
In [8]:
weights <- c(2,3,2,3,2,3,1)

## Note that the order of arguments doesn't matter, because we have labeled the default.

WeightAvg(weights = weights, xs)
8.71428571428571
In [10]:
## Order matters when arguments aren't labeled.

ShowArgs <- function(xs, weights = rep(1,length(xs))) {

    print(weights)
    print(xs)

}

# Note that the above function does not return anything
In [11]:
ShowArgs(xs,weights)
[1] 2 3 2 3 2 3 1
[1] 1 2 3 4 5 6 7
In [12]:
ShowArgs(weights,xs)
[1] 1 2 3 4 5 6 7
[1] 2 3 2 3 2 3 1
In [13]:
ShowArgs(weights = weights,xs)
[1] 2 3 2 3 2 3 1
[1] 1 2 3 4 5 6 7

Examples

  1. Write a function MySd that takes a vector of numbers and returns their standard deviation as calculated from the following formula:

    \[ \begin{align}\begin{aligned} \sqrt{\frac{\sum{(x - \bar{x})^2}}{n-1}}\\where :math:`x` is some vector of numbers, :math:`\bar{x}` is the\end{aligned}\end{align} \]

    mean of \(x\) and \(n\) is th number of elements in \(x\).

    What is the standard deviation of 1:10?

  2. Write a function that takes a matrix and multiplies it by a vector (for those who don’t know about matrix multiplication, an in-class example will be provided). Provide the identity matrix as a default: diag(rep(1,length(v)).

Mapping with apply

R provides tools for performing vectorized operations with nearly any function. We will first talk about the *apply group, but the more modern version of these are part of the ‘tidyverse’ in the package ‘purr’. More on those later...

Suppose I have a matrix, and I would like to obtain a sum of each column, and store that in a vector. One way to do this is to simply loop over the columns using a for loop:

In [21]:
my_matrix <- matrix(rnorm(30), ncol = 6, nrow = 5)

sums <- rep(0,6)

for (i in 1:6){
    sums[i] <- sum(my_matrix[,i])
}

print(sums)
[1] -2.0673368  3.3092049 -1.3901258 -5.1382279  0.8350368 -0.7088264

But there is another way, that is better for various reasons:

In [23]:
sums <- apply(FUN = sum, X = my_matrix, MARGIN = 2)

print(sums)
[1] -2.0673368  3.3092049 -1.3901258 -5.1382279  0.8350368 -0.7088264

Note that we can easily change this to a row sum:

In [27]:
sums <- apply(FUN = sum, X = my_matrix, MARGIN = 1)

print(sums)
[1]  0.6588709 -1.5304585 -3.0984142  0.6073535 -1.7976270

There are a number of things that make this version better than the for loop. There is less code, with fewer variables that need to be specified.

  • In the for loop, we need to somehow include the dimension, so that the loop knows how many interations to perform.
  • In the apply version, We can easily convert the column sum into a row sum and we do not need to specify dimension in either case.
  • The apply code can be easily parallelized (this is beyond our scope, but something to keep in mind).
  • The less code we write, the less chance for bugs to crop up.

Note that apply works for user generated functions as well:

In [28]:
apply(MyMean, X = my_matrix, MARGIN = 1)
  1. 0.109811818585238
  2. -0.255076413886755
  3. -0.516402373949533
  4. 0.101225588404194
  5. -0.299604492013341
In [29]:
apply(MyMean, X = my_matrix, MARGIN = 2)
  1. -0.41346736204718
  2. 0.661840972673175
  3. -0.278025169001729
  4. -1.02764557513093
  5. 0.167007368691562
  6. -0.141765282617135

There are a number of variants of the apply function that differ on the data types they return or take as input. The lapply function works on lists (as opposed to 2 dimensional arrays such as matrices) and returns a list object. The sapply function is the same as lapply, but it attempts to simplify output to be the simplest possible class (e.g. a list of numerics is returned as a vector, rather than a list).

Examples

  1. Create a matrix using rnorm with 10 rows and 15 columns. Use apply to obtain the median of rows and then over the columns.
  2. Creat the following list: x <- list(a = rep(1,7), b = 1:3, c = 10:100)

Use lapply and sapply to get the length of each element of the list.

In [37]:
# Your code here

Next, we will explore dataframes and something new called a ‘tibble’. We’ll see how to manipulate these structures, and then we will talk about the new generation of mapping functions.