Using R for supervised learning¶

This notebook goes over the basic concept of how to construct and use a supervised learning pipleine for classificaioon. We will use the k-nearest neighbors algorithm for illustration, but the baisc ideas carry over to all algorithms for classificaiton and regression.

In [1]:

healthdy <- read.table('healthdy.txt', header = TRUE)

In [2]:

head(healthdy)

Out[2]:

	ID	GENDER	FLEXPRE	FLEXPOS	BAWPRE	BAWPOS	BWWPRE	BWWPOS	BFPPRE	BFPPOS	FVCPRE	FVPOS	METSPRE	METSPOS
1	0	1	21.000	21.500	70.5	75.6	3.3	3.7	14.58	14.17	5.1	5.1	12.7	18.0
2	2	1	21.000	21.250	71.3	70.7	3.2	3.6	16.79	13.95	4.3	4.3	11.1	12.0
3	3	1	21.500	20.000	64.5	66.6	4.1	4.0	6.6	08.98	4.5	4.5	15.3	16.7
4	4	1	23.000	23.375	97	95.0	4.4	4.3	18.04	17.32	4.7	4.3	12.0	17.5
5	5	1	21.000	21.000	71	73.2	3.7	3.8	11.12	11.50	5.8	5.8	12.2	12.2
6	6	1	20.500	20.750	72.5	73.1	3.1	3.4	17.88	16.22	4.3	4.3	11.1	10.0

Data from Health Dynamics class at Hope College--collected about 1985 by Gregg Afman. Downloaded from http://www.math.hope.edu/swanson/data/healthdy.txt Gender 1 = Male Gender 2 = Female Flexpre = Flexability at the beginning of the semester Flexpro = Flexability at the end of the semester Bawpre = Air Weight at the beginning of the semester Bawpro = Air weight at the end of the semester Bwwpre = Water weight at the beginning of the semester Bwwpro = Water weight at the end of the semester Bfppre = Body fat at the beginning of the semester Bfppro = Body fat at the end of the semester Fvcpre = Forced capacity at the beginning of the semester Fvcpro = Forced capacity at the end of the semester Metspre = Mets at the beginning of the semester Metspro = Mets at the end of the semester

Supervised learning problem¶

For simplicity and ease of visualization, we will just use the first 2 indepdendent variables as fearures for predicitng gender. In practice, the selection of approprieate features to use as predictors can be a challenging problem that greatly affects the effectiveness of supervised learning.

So the problme is: How accurately can we guess the gender of a student from the Flexpre and Bawpre variables?

Visualizing the data¶

First let’s make a smaller dataframe containing just the variables of interest, and make some plots.

In [3]:

df <- healthdy[,c("ID", "GENDER", "FLEXPRE", "BAWPRE")]
df$ID <- factor(df$ID)
df$GENDER <- factor(df$GENDER, labels = c("Male", "Female"))
df$FLEXPRE <- as.numeric(df$FLEXPRE)

In [4]:

summary(df)

Out[4]:

ID         GENDER       FLEXPRE          BAWPRE
    :  2   Male  : 82   Min.   : 1.00   Min.   :35.20
    :  2   Female:100   1st Qu.:26.00   1st Qu.:57.73
    :  2                Median :42.00   Median :65.05
    :  2                Mean   :38.76   Mean   :66.99
    :  2                3rd Qu.:52.00   3rd Qu.:74.50
    :  2                Max.   :67.00   Max.   :98.50
 (Other):170

Let’s check the mean flexibilitiy and weights for boys and girls.

In [5]:

with(df, aggregate(df[,3:4], by=list(Gender=GENDER), FUN=mean))

Out[5]:

	Gender	FLEXPRE	BAWPRE
1	Male	33.46341	75.89024
2	Female	43.1	59.695

On average, girls are more flexible and weigh less than boys. This is confirmed viually.

In [6]:

plot(df$FLEXPRE, df$BAWPRE, col=df$GENDER,
     xlab="Flexibility", ylab="Weight",
     main="Flexibility and Weight grouped by Gender")
legend(0, 100, c("Male", "Female"), pch=1, col=1:2)

Comments¶

It looks like there is a pretty good probablility that we can guess the gender from the body weight and flexibilty alone. The k-nearest neighbor does this guessing in a very simple fashion - Given any point in the data set, it looks for the nearest k neighboring points, and simply uses the majority gender among these neighbors as the guess. In the sections below, we’ll implement a supervised learnign pipeline using k-nearest neighbors.

Work!¶

Review questions to make sure you are up to speed with basic data manipulation and plotting.

Q1. Tabulate the median value of FLEXPRE and BAWPRE by gender.

In [ ]:

Q2. Tabluate the average change in weight from the beginnig to the end of the semester by gender.

In [ ]:

Q3. Identify from the plot above the IDs of 3 individuals for whom you expect k-nearest neighbors to make the wrong gender prediction. HInt: Make a scatterplot but add the IDs as labels for each point, using a small x-offset of 2.5 so that labels are immediately to the right of each point.

In [ ]:

Splitting data into training and test data sets¶

We will use 3/4 of the data to train the algorithm and 1/4 to test how good it is. The reason for doing this is that if we train on the full data set, the algorithm has “seen” its test points before, and hence will seem more accurate than it really is with respect to new data samples. ‘Holding out” some of the data for testing that is not used for training the algorithm allows us to honestly evaluate the “out of sample” error accurately.

In [7]:

set.seed(123) # set ranodm number seed for reproducibility
size <- floor(0.75 * nrow(df)) # desired size of training set
df <- df[sample(nrow(df), replace = FALSE),] # shuffle rows randomly
df.train <- df[1:size, ] # take first size rows of shuffled data frame as training set
df.test <- df[(size+1):nrow(df), ] # take the remaining rows as the test set
x.train <- df.train[,c("FLEXPRE", "BAWPRE")]
y.train <- df.train[,"GENDER"]
x.test <- df.test[,c("FLEXPRE", "BAWPRE")]
y.test <- df.test[,"GENDER"]

In [8]:

summary(df.train)

Out[8]:

ID         GENDER      FLEXPRE          BAWPRE
    :  2   Male  :61   Min.   : 1.00   Min.   :35.20
    :  2   Female:75   1st Qu.:24.00   1st Qu.:57.70
    :  2               Median :42.00   Median :65.40
   :  2               Mean   :38.49   Mean   :67.34
   :  2               3rd Qu.:52.00   3rd Qu.:74.88
   :  2               Max.   :67.00   Max.   :98.50
 (Other):124

In [9]:

summary(df.test)

Out[9]:

ID        GENDER      FLEXPRE          BAWPRE
   : 2   Male  :21   Min.   : 2.00   Min.   :41.90
   : 2   Female:25   1st Qu.:31.25   1st Qu.:58.10
   : 2               Median :42.00   Median :64.60
   : 2               Mean   :39.54   Mean   :65.97
    : 1               3rd Qu.:51.00   3rd Qu.:73.15
    : 1               Max.   :65.00   Max.   :90.20
 (Other):36

Train knn on training set¶

In [10]:

library(class)
y.pred <- knn(x.train, x.test, cl=y.train, k=3)

In [11]:

y.pred

Out[11]:

Male
Female
Male
Female
Male
Female
Female
Female
Female
Female
Male
Female
Female
Female
Male
Female
Male
Male
Female
Female
Female
Male
Male
Female
Male
Female
Female
Female
Female
Male
Female
Male
Female
Male
Female
Male
Female
Female
Male
Male
Female
Female
Male
Male
Male
Male

Evaluate the model¶

In [12]:

table(y.pred, y.test)

Out[12]:

y.test
y.pred   Male Female
  Male     16      4
  Female    5     21

Who was predicted wrongly?¶

In [13]:

misses <- y.pred != y.test

In [14]:

df.test[misses,]

Out[14]:

	ID	GENDER	FLEXPRE	BAWPRE
56	56	Male	26	63.7
3	3	Male	46	64.5
160	77	Female	24	73.3
73	73	Male	53	64.7
15	15	Male	44	72.7
51	51	Male	56	61.5
120	37	Female	56	69.9
100	17	Female	42	67.3
99	16	Female	7	64.8

In [15]:

plot(df.test$FLEXPRE, df.test$BAWPRE, col=df.test$GENDER,
     xlab="Flexibility", ylab="Weight",
     main="Flexibility and Weight grouped by Gender")
points(df.test$FLEXPRE[misses], df.test$BAWPRE[misses], col="blue", cex=2)

legend(0, 100, c("Male", "Female"), pch=1, col=1:2)

Work!¶

Repeat the analysis using the Bfppre and Fvcpre variables as predictors instead. Do you get better or worse predictions?

Q!. Extract the relevant variables into a new data frame.

In [ ]:

Q2. Visualize Bfppre and Fvcpre grouped by gender.

In [ ]:

Q3. Split the data into training and test data sets using a 2/3, 1/3 ratio.

In [ ]:

Q4. Find the predictions make by knn with 5 neighbors.

In [ ]:

Q5.. Make a table of true positives, false positives, true negatives and false negatives. Calculate

accuracy
sensitivity
specificity
possitive predicrvie value
negative predictive value
f-score (harmonic mean of senistivity and specificity)

Look up definitions in Wikipedia if you don’t know what these mean.

In [ ]:

Q6 Make a plot to identify mis-classified subjects if any.

In [ ]:

Cross-validation¶

Splitting into training and test data set is all well and good, but when the amound data we have is small, it is wastful to have 1/4 or 1/3 as “hold out” test data that is not used to train the algorithm. An alternaitve is to perform cross-validation, in which we split the data into k equal groups and cycle through all possible combinatinos of k-1 training and 1 test group. For example, if we split the data into 4 groups (“4-fold cross-validation”), we would do

Train on 1,2,3 and Test on 4
Train on 1,2,4 and Test on 3
Train on 1,3,4 and Test on 2
Train on 2,3,4 and Test on 1

then finally sum the test results to evalate the algorithm’s performance. The limiting example where we split into as many n groups (where n is the number of data poits) and test on only 1 data point each time is known as Leave-One-Out-Cross-Validation (LOOCV).

LOOCV¶

We will use a simple (inefficient) loop version of the algorithm that should be quite easy to understand.

First, we will recreat the data set in case you overwrote the variables in the exercises.

In [16]:

df <- healthdy[,c("ID", "GENDER", "FLEXPRE", "BAWPRE")]
df$ID <- factor(df$ID)
df$GENDER <- factor(df$GENDER, labels = c("Male", "Female"))
df$FLEXPRE <- as.numeric(df$FLEXPRE)

In [17]:

summary(df)

Out[17]:

ID         GENDER       FLEXPRE          BAWPRE
    :  2   Male  : 82   Min.   : 1.00   Min.   :35.20
    :  2   Female:100   1st Qu.:26.00   1st Qu.:57.73
    :  2                Median :42.00   Median :65.05
    :  2                Mean   :38.76   Mean   :66.99
    :  2                3rd Qu.:52.00   3rd Qu.:74.50
    :  2                Max.   :67.00   Max.   :98.50
 (Other):170

In [18]:

y.test <- df[,"GENDER"]
y.pred <- y.test # we will overwirite the entries in the loop
for (i in 1:nrow(df)) {
    x.test <- df[i, c("FLEXPRE", "BAWPRE")]
    x.train <- df[-i, c("FLEXPRE", "BAWPRE")] # the minus menas keep all rows except i
    y.train <- df[-i, "GENDER"]
    y.pred[i] <- knn(x.train, x.test, cl=y.train, k=3)
}

In [19]:

table(y.pred, y.test)

Out[19]:

y.test
y.pred   Male Female
  Male     62     18
  Female   20     82

In [20]:

misses <- y.test != y.pred

plot(df$FLEXPRE, df$BAWPRE, col=df$GENDER,
     xlab="Flexibility", ylab="Weight",
     main="Flexibility and Weight grouped by Gender")
points(df$FLEXPRE[misses], df$BAWPRE[misses], col="blue", cex=2)

legend(0, 100, c("Male", "Female"), pch=1, col=1:2)

Work!¶

Q1. Increase the number of neighbors to 7. Does it improve the reuslts? Whatt are the tradeoff of increasing or decraesaing the number of neighbors?

In [ ]:

Q2. Implement 5-fold cross-validation for the FLEXPRE and BAWPRE variables. Tablutate the hits and misses and make a plot as in the previous examples.

In [ ]:

Using R for supervised learning¶

Supervised learning problem¶

Visualizing the data¶

Comments¶

Work!¶

Splitting data into training and test data sets¶

Train knn on training set¶

Evaluate the model¶

Who was predicted wrongly?¶

Work!¶

Cross-validation¶

LOOCV¶

Work!¶

Page contents

Previous page

Next page

This Page