{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised Learning Continued - What Could Go Wrong?" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "In this lab, we will demonstrate some common pitfalls that may be encountered in performing a supervised learning analysis. To this end, we will simulate data under the null (meaning, we will simulate no relationship between outcome and independent variables) to observe situations where we may commit type I error." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## Supervised\n", "\n", "# Simulate noisy data matrix (EXPRS)\n", "set.seed(123)\n", "# We'll use 2 groups of 20 subjects - think 20 cases and 20 controls\n", "n=20\n", "# Simulate 1000 genes\n", "m=1000\n", "\n", "# randomly generate a matrix of 'expression levels' -- any continuous variable we may be interested in\n", "EXPRS=matrix(rnorm(2*n*m),2*n,m)\n", "\n", "# Just naming rows and columns\n", "rownames(EXPRS)=paste(\"patient\",1:(2*n),sep=\"\")\n", "colnames(EXPRS)=paste(\"gene exp\",1:m,sep=\"\")\n", "\n", "# The group labels are assigned arbitrarily - i.e. we are just randomly assigning \n", "# case/control status with no reference to gene expression\n", "grp=rep(0:1,c(n,n))\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "Attaching package: ‘genefilter’\n", "\n", "The following object is masked from ‘package:base’:\n", "\n", " anyNA\n", "\n" ] }, { "data": { "text/html": [ "
statistic | dm | p.value | |
---|---|---|---|
gene exp1 | 0.6746243 | 0.192881 | 0.5039985 |
gene exp2 | 0.7417175 | 0.2264023 | 0.4628184 |
gene exp3 | 3.025423 | 0.7344752 | 0.004436959 |
gene exp4 | 0.4030939 | 0.1349871 | 0.6891382 |
gene exp5 | 0.9545301 | 0.3004477 | 0.3458485 |
gene exp6 | 0.3305064 | 0.09782354 | 0.7428327 |