Supervised Learning I: classification

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How can I apply supervised learning to a data set?

Objectives

Know the basic Machine Learning terminology.

Build a model to predict a categorical target variable.

Apply logistic regression and random forests algorithms to a data set and compare them.

Learn the importance of separating data into training and test sets.

Supervised Learning I: classification

As mentioned in passing before: Supervised learning, is the branch of machine learning that involves predicting labels, such as whether a tumour will be benign or malignant.

In this section, you’ll attempt to predict tumour diagnosis based on geometrical measurements.

Discussion

Look at your pair plot. What would a baseline model there be?

Exercise

Build a model that predicts diagnosis based on whether X3 > 15 or something similar.
Solution
# Build baseline model
df$pred <- ifelse(df$X3 > 15, "M", "B")
df$pred

This is not a great model but it does give us a baseline: any model that we build later needs to perform better than this one.

Whoa: what do we mean by model performance here? There are many metrics to determine model performance and here we’ll use accuracy, the percentage of the data that the model got correct.

Note on terminology

The target variable is the one you are trying to predict;

Other variables are known as features (or predictor variables).

We first need to change df$X2, the target variable, to a factor:

# What is the class of X2?
class(df$X2)

[1] "character"

# Change it to a factor
df$X2 <- as.factor(df$X2)
# What is the class of X2 now?
class(df$X2)

[1] "factor"

Calculate baseline model accuracy:

# Calculate accuracy
confusionMatrix(as.factor(df$pred), df$X2)

Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 345  51
         M  12 161
                                          
               Accuracy : 0.8893          
                 95% CI : (0.8606, 0.9139)
    No Information Rate : 0.6274          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.754           
 Mcnemar's Test P-Value : 1.688e-06       
                                          
            Sensitivity : 0.9664          
            Specificity : 0.7594          
         Pos Pred Value : 0.8712          
         Neg Pred Value : 0.9306          
             Prevalence : 0.6274          
         Detection Rate : 0.6063          
   Detection Prevalence : 0.6960          
      Balanced Accuracy : 0.8629          
                                          
       'Positive' Class : B               
                                          

Now it’s time to build an ever so slightly more complex model, a logistic regression.

Logistic regression

Let’s build a logistic regression. You can read more about how logistic works here and the instructor may show you some motivating and/or explanatory equations on the white/chalk-board. What’s important to know is that logistic regression is used for classification problems (such as our case of predicting whether a tumour is benign or malignant).

Note on logistic regression

Logistic regression, or logreg, outputs a probability, which you’ll then convert to a prediction.

Now build that logreg model:

# Build model
model <- glm(X2 ~ ., family = "binomial", df)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Predict probability on the same dataset
p <- predict(model, df, type="response")
# Convert probability to prediction "M" or "B"
pred <- ifelse(p > 0.50, "M", "B")

# Create confusion matrix
confusionMatrix(as.factor(pred), df$X2)

Confusion Matrix and Statistics

          Reference
Prediction   B   M
         B 357   0
         M   0 212
                                     
               Accuracy : 1          
                 95% CI : (0.9935, 1)
    No Information Rate : 0.6274     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.6274     
         Detection Rate : 0.6274     
   Detection Prevalence : 0.6274     
      Balanced Accuracy : 1.0000     
                                     
       'Positive' Class : B          
                                     

Discussion

From the above, can you say what the model accuracy is?

Also, don’t worry about the warnings. See here for why.

BUT this is the accuracy on the data that you trained the model on. This is not necessarily indicative of how the model will generalize to a dataset that it has never seen before, which is the purpose of building such models. For this reason, it is common to use a process called train test split to train the model on a subset of your data and then to compute the accuracy on the test set.

# Set seed for reproducible results
set.seed(42)
# Train test split
inTraining <- createDataPartition(df$X2, p = .75, list=FALSE)
# Create train set
df_train <- df[ inTraining,]
# Create test set
df_test <- df[-inTraining,]
# Fit model to train set
model <- glm(X2 ~ ., family="binomial", df_train)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Predict on test set
p <- predict(model, df_test, type="response")
pred <- ifelse(p > 0.50, "M", "B")

# Create confusion matrix
confusionMatrix(as.factor(pred), df_test$X2)

Confusion Matrix and Statistics

          Reference
Prediction  B  M
         B 88  6
         M  1 47
                                        
               Accuracy : 0.9507        
                 95% CI : (0.9011, 0.98)
    No Information Rate : 0.6268        
    P-Value [Acc > NIR] : <2e-16        
                                        
                  Kappa : 0.8926        
 Mcnemar's Test P-Value : 0.1306        
                                        
            Sensitivity : 0.9888        
            Specificity : 0.8868        
         Pos Pred Value : 0.9362        
         Neg Pred Value : 0.9792        
             Prevalence : 0.6268        
         Detection Rate : 0.6197        
   Detection Prevalence : 0.6620        
      Balanced Accuracy : 0.9378        
                                        
       'Positive' Class : B             
                                        

Random Forests

This caret API is so cool you can use it for lots of models. You’ll build random forests below. Before describing random forests, you’ll need to know a bit about decision tree classifiers. Decision trees allow you to classify data points (also known as “target variables”, for example, benign or malignant tumor) based on feature variables (such as geometric measurements of tumors). See here for an example. The depth of the tree is known as a hyperparameter, which means a parameter you need to decide before you fit the model to the data. You can read more about decision trees here. A random forest is a collection of decision trees that fits different decision trees with different subsets of the data and gets them to vote on the label. This provides intuition behind random forests and you can find more technical details here.

Before you build your first random forest, there’s a pretty cool alternative to train test split called k-fold cross validation that we’ll look into.

Cross Validation

To choose your random forest hyperparameter max_depth, for example, you’ll use a variation on test train split called cross validation.

We begin by splitting the dataset into 5 groups or folds (see here, for example). Then we hold out the first fold as a test set, fit our model on the remaining four folds, predict on the test set and compute the metric of interest. Next we hold out the second fold as our test set, fit on the remaining data, predict on the test set and compute the metric of interest. Then similarly with the third, fourth and fifth.

As a result we get five values of accuracy, from which we can compute statistics of interest, such as the median and/or mean and 95% confidence intervals.

We do this for each value of each hyperparameter that we’re tuning and choose the set of hyperparameters that performs the best. This is called grid search if we specify the hyperparameter values we wish to try, and called random search if we search randomly through the hyperparameter space (see more here).

You’ll first build a random forest with a grid containing 1 hyperparameter to get a feel for it.

# Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
mtry <- sqrt(ncol(df))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(X2~., data=df, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)

Random Forest 

569 samples
 31 predictor
  2 classes: 'B', 'M' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 511, 513, 512, 513, 513, 512, ... 
Resampling results:

  Accuracy   Kappa    
  0.9625076  0.9193644

Tuning parameter 'mtry' was held constant at a value of 5.656854

Now try your hand at a random search:

# Random Search
control <- trainControl(method="repeatedcv", number=5, repeats=3, search="random")
mtry <- sqrt(ncol(df))
rf_random <- train(X2~., data=df, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)

Random Forest 

569 samples
 31 predictor
  2 classes: 'B', 'M' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 455, 456, 455, 456, 454, 456, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9578156  0.9092135
   6    0.9619042  0.9180916
  12    0.9613194  0.9168014
  15    0.9607345  0.9154082
  16    0.9619092  0.9180466
  18    0.9625044  0.9193323
  19    0.9607242  0.9155330
  21    0.9624992  0.9192701
  23    0.9577897  0.9091203
  24    0.9601598  0.9143988
  25    0.9589697  0.9117203
  28    0.9601341  0.9142834
  30    0.9589696  0.9118957

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 18.

And plot the results:

plot(rf_random)

plot of chunk unnamed-chunk-10

Key Points

The target variable is the variable of interest, while the rest of the variables are known as features or predictor variables.

Separate your data set into training and test sets to avoid overfitting.

Logistic regression and random forests can be used to predict categorical variables.

previous episode

Machine Learning in R

next episode

Supervised Learning I: classification

Overview