Basic R in the Jupyter Notebook and RStudio

The first thing you should do with this notebook is make a copy! Go to the file menu and choose ‘Make a Copy’.

Great! Now you should be working in a file called ‘BasicRinJupyterAndRstudio-Copy1’

Briefly go back to the other tab and choose ‘Close and Halt’ from the file menu. Why did we do this? Because we have the lecture materials in something called a ‘github repository’. We may need to make changes to notebooks and if and when we do, we will show you how to update your VM with the new materials. We want whatever changes you make to be saved under different files and this seems to be the easiest way. (Note: even if you don’t think you are making changes, the notebook ‘autosaves’ - so your notebook will almost always be considered ‘changed’ as far as github is concerned... anyway - just trust me!

R is a programming environment created specifically for statistics. It is a scripting language (if you don’t know what that means, don’t worry for now). R can be used interactively (as we will see in this notebook), or it can be told to execute a list of commands stored in a plain text file (called a ‘script’).

From within the Jupyter notebook, we can access the R ‘kernel’ (the program that interprets R code and returns results). This is just one way to use R. We will also learn to use a program called Rstudio.

The Notebook

As you can see, the notebook is browser based (it opens a window in your browser) and works a lot like a web server. This notebook is running an R kernel, but we could chooose a Python kernel, a bash kernel (unix shell) or from a long list that is currently expanding:

https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages

Be careful, though. Much of this is under development and considered ‘beta’ (or even alpha) - the tools can be buggy. We will be careful to use only the better developed parts of the Jupyter universe.

The notebook is comprised of ‘cells’. Cells can either be ‘code’ or ‘markdown’. Code is for writing R commands. Markdown is for text and is an extension of html. This cell is a markdown cell.

More about markdown here:

https://en.wikipedia.org/wiki/Markdown

You can type anything you want in markdown (though there are some special characters that will be interpreted as commands).

Code cells require R syntax. The following cell is an R cell:

In [2]:
# This is an R cell. The '#' tells R this is a comment
3+1
Out[2]:
4

When I run the code cell, it executes the R code (3+1) and returns the output in an output cell. To run a code cell, you can type shift-enter, or press the run button at the top of the screen.

R Basics - Data Types

In any programming language, we have the notion of ‘data modes’ and ‘data structures’. This is because programs manipulate data, and different kinds of data require different manipulations. For example, numbers are treated differently than characters (or strings of characters) and single numbers are treated differently than lists of numbers (vectors) or arrays of numbers (matrices).

The following are some simple R data modes:

  • numeric
  • character
  • logical (TRUE or FALSE)
  • complex (we won’t worry about these!)

Modes can be combined to form data structures:

  • Vectors
  • Matrices
  • Strings
  • Data Frames

Examples

In [47]:
class(c(1,3,2,8.4)) #This is a vector
Out[47]:
'numeric'
In [48]:
class(matrix(c(1,3,2,4,5,6),nrow=2,ncol=3)) #This is a matrix
Out[48]:
'matrix'
In [54]:
"This is a string!"
Out[54]:
'This is a string!'
In [58]:
data.frame(c("This is a string","This is another string"),matrix(c(1:6),nrow=2,ncol=3))
Out[58]:
c..This.is.a.string....This.is.another.string..X1X2X3
1This is a string135
2This is another string246

The important thing to note above is the combination of both character and numeric data into one object! That is what is special about data frames.

R Basics - Creating Objects

c - Concatenate

We have just seen this command in action. The ‘c’ command combines objects by concatenation. For example:

In [30]:
c(5,6,7)
Out[30]:
  1. 5
  2. 6
  3. 7

creates a vector of length 3. We can append to that vector, like so:

In [32]:
c(c(5,6,7),8)
Out[32]:
  1. 5
  2. 6
  3. 7
  4. 8

Of course, we would usually have named the first vector something:

In [50]:
v1<-c(5,6,7)
v2<-c(v1,8)
print(v1)
print(v2)
[1] 5 6 7
[1] 5 6 7 8

rbind

Now, if we would like to create a matrix, we could use the matrix command as above:

In [51]:
matrix(c(1,3,2,4,5,6),nrow=2,ncol=3)
Out[51]:
125
346

Or, we could create two vectors and combine them:

In [43]:
v1<-c(1,2,5)
v2<-c(3,4,6)
m1<-rbind(v1,v2)
m1
class(m1)
Out[43]:
v1125
v2346
Out[43]:
'matrix'

Notice that R has automatically assigned row names for us. Thank you, R! We can also use the column-based version (rbind means ‘row bind’) to append a column to a matrix:

cbind

In [6]:
m1<-matrix(c(1,2,3,4),nrow=2,ncol=2)
m1
Out[6]:
13
24
In [1]:
m2<-cbind(m1,c(5,6))
m2
Error in cbind(m1, c(5, 6)): object 'm1' not found

Error in eval(expr, envir, enclos): object 'm2' not found

Getting Help in the Notebook

If you know the name of the command you want to use, you can just type ‘? command_name’ in a code cell and run it, like so:

In [2]:
?help
Out[2]:
help {utils}R Documentation

Documentation

Description

help is the primary interface to the help systems.

Usage

help(topic, package = NULL, lib.loc = NULL,
     verbose = getOption("verbose"),
     try.all.packages = getOption("help.try.all.packages"),
     help_type = getOption("help_type"))

Arguments

topic

usually, a name or character string specifying the topic for which help is sought. A character string (enclosed in explicit single or double quotes) is always taken as naming a topic.

If the value of topic is a length-one character vector the topic is taken to be the value of the only element. Otherwise topic must be a name or a reserved word (if syntactically valid) or character string.

See ‘Details’ for what happens if this is omitted.

package

a name or character vector giving the packages to look into for documentation, or NULL. By default, all packages whose namespaces are loaded are used. To avoid a name being deparsed use e.g. (pkg_ref) (see the examples).

lib.loc

a character vector of directory names of R libraries, or NULL. The default value of NULL corresponds to all libraries currently known. If the default is used, the loaded packages are searched before the libraries. This is not used for HTML help (see ‘Details’.

verbose

logical; if TRUE, the file name is reported.

try.all.packages

logical; see Note.

help_type

character string: the type of help required. Possible values are "text", "html" and "pdf". Case is ignored, and partial matching is allowed.

Details

The following types of help are available:

  • Plain text help

  • HTML help pages with hyperlinks to other topics, shown in a browser by browseURL. (Where possible an existing browser window is re-used: the OS X GUI uses its own browser window.) If for some reason HTML help is unavailable (see startDynamicHelp), plain text help will be used instead.

  • For help only, typeset as PDF – see the section on ‘Offline help’.

The ‘factory-fresh’ default is text help except from the OS X GUI, which uses HTML help displayed in its own browser window.

The rendering of text help will use directional quotes in suitable locales (UTF-8 and single-byte Windows locales): sometimes the fonts used do not support these quotes so this can be turned off by setting options(useFancyQuotes = FALSE).

topic is not optional: if it is omitted R will give

  • If a package is specified, (text or, in interactive use only, HTML) information on the package, including hints/links to suitable help topics.

  • If lib.loc only is specified, a (text) list of available packages.

  • Help on help itself if none of the first three arguments is specified.

Some topics need to be quoted (by backticks) or given as a character string. These include those which cannot syntactically appear on their own such as unary and binary operators, function and control-flow reserved words (including if, else for, in, repeat, while, break and next). The other reserved words can be used as if they were names, for example TRUE, NA and Inf.

If multiple help files matching topic are found, in interactive use a menu is presented for the user to choose one: in batch use the first on the search path is used. (For HTML help the menu will be an HTML page, otherwise a graphical menu if possible if getOption("menu.graphics") is true, the default.)

Note that HTML help does not make use of lib.loc: it will always look first in the loaded packages and then along .libPaths().

Offline help

Typeset documentation is produced by running the LaTeX version of the help page through pdflatex: this will produce a PDF file.

The appearance of the output can be customized through a file ‘Rhelp.cfg’ somewhere in your LaTeX search path: this will be input as a LaTeX style file after Rd.sty. Some environment variables are consulted, notably R_PAPERSIZE (via getOption("papersize")) and R_RD4PDF (see ‘Making manuals’ in the ‘R Installation and Administration Manual’).

If there is a function offline_help_helper in the workspace or further down the search path it is used to do the typesetting, otherwise the function of that name in the utils namespace (to which the first paragraph applies). It should accept at least two arguments, the name of the LaTeX file to be typeset and the type (which is nowadays ignored). It accepts a third argument, texinputs, which will give the graphics path when the help document contains figures, and will otherwise not be supplied.

Note

Unless lib.loc is specified explicitly, the loaded packages are searched before those in the specified libraries. This ensures that if a library is loaded from a library not in the known library trees, then the help from the loaded library is used. If lib.loc is specified explicitly, the loaded packages are not searched.

If this search fails and argument try.all.packages is TRUE and neither packages nor lib.loc is specified, then all the packages in the known library trees are searched for help on topic and a list of (any) packages where help may be found is displayed (with hyperlinks for help_type = "html"). NB: searching all packages can be slow, especially the first time (caching of files by the OS can expedite subsequent searches dramatically).

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

? for shortcuts to help topics.

help.search() or ?? for finding help pages on a vague topic; help.start() which opens the HTML version of the R help pages; library() for listing available packages and the help objects they contain; data() for listing available data sets; methods().

Use prompt() to get a prototype for writing help pages of your own package.

Examples

help()
help(help)              # the same

help(lapply)

help("for")             # or ?"for", but quotes/backticks are needed

try({# requires working TeX installation:
 help(dgamma, help_type = "pdf")
 ## -> nicely formatted pdf -- including math formula -- for help(dgamma):
 system2(getOption("pdfviewer"), "dgamma.pdf", wait = FALSE)
})

help(package = "splines") # get help even when package is not loaded

topi <- "women"
help(topi)

try(help("bs", try.all.packages = FALSE)) # reports not found (an error)
help("bs", try.all.packages = TRUE)       # reports can be found
                                          # in package 'splines'

## For programmatic use:
topic <- "family"; pkg_ref <- "stats"
help((topic), (pkg_ref))

[Package utils version 3.2.0 ]

Don’t forget Google

If you want to do something, but don’t know the command in R, Google can be a great tool! Go ahead and google how to create a histogram in R.

Play!

Sure, we know we need to work to learn new things - but let’s not underestimate the power of play! Take a few moments to play in this new sandbox. Create some vectors, matrices, strings, etc. What can you do? Can you figure out how to make R multiply a matrix times (an appropriately sized) vector? Multiply two matrices? What happens if you add two vectors? Multiply? Do you get the answer you expect?

In [1]:
# Start here - you can work within the lecture notes! How cool is that?

Now, the notebook is a great environment, especially for doing reproducible research, documenting all your steps and keeping track of things during exploratory analysis. There is another R interface available, and it is much more ‘mature’ than the Jupyter project (mature does not mean ‘better’ - just that some features available in R and Rstudio may not yet be incorporated in the Jupyter R kernel).

Rstudio

Your VM has been setup as an Rstudio ‘server’. This means that you can connect to it in a similar manner as you did to the notebook server. Use the following URL:

http://colab-sbx-XXX.oit.duke.edu:8787

Rstudio Server uses the system login authentication, so type in your VM username (bitnami) and the password you were assigned by Duke’s OIT.

Your window should look like so:

In the upper right corner, we have a ‘script’ window. This is where you type code that you would like to save into a file. In the lower right is a console window. This is where you type code to execute. It is just like the command line in unix. You type code, press enter, and it gets executed.

On the right hand side is a window with tabs for Files, Plots, Packages, Help and Viewer. As we are beginners, the ‘help’ tab will be the most relevant. Here, we can find information on syntax, what functions do what, tutorials, etc.

Work!

  • Plot a histogram of 100 numbers generated from the standard normal distribution. Hint: Use the ‘Search Engine’ feature under ‘Help’ to find out how to generate random numbers from the standard normal, then search for ‘histogram’. Don’t use the ggplot histogram. We’ll cover graphics grammar later on!
  • Use the script window to compute the mean, variance and median of the following list: (1,2,5,5,2,5,6,8,1,10) and save the script under the name ‘Example1.R’
  • Do the same below, using Jupyter! (Hint: Go to the ‘home’ screen for Jupyter and click on the ‘new’ button on the upper right of the screen. Choose ‘text’ file and enter your commands. Save the file under the appropriate name. Use the ‘source’ command to run the code in and R cell.)