Loading and exploring data

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • What is Exploratory Data Analysis (EDA) and why is it useful?

  • How can I do EDA in R?

Objectives
  • Use caret to preprocess data.

Setting up

Make sure you have installed R and RStudio, and installed and loaded the necessary packages from the Setup section.

Loading your data

It’s time to import the first dataset that we’ll work with, the Breast Cancer Wisconsin (Diagnostic) Data Set from the UCI Machine Learning repository.

Do this and check out the first several rows:

# Load data
df <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
               col_names = FALSE)
# Check out head of dataframe
df %>% head()
# A tibble: 6 x 32
        X1 X2       X3    X4    X5    X6     X7     X8     X9    X10   X11
     <int> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
1   842302 M      18.0  10.4 123.  1001  0.118  0.278  0.300  0.147  0.242
2   842517 M      20.6  17.8 133.  1326  0.0847 0.0786 0.0869 0.0702 0.181
3 84300903 M      19.7  21.2 130   1203  0.110  0.160  0.197  0.128  0.207
4 84348301 M      11.4  20.4  77.6  386. 0.142  0.284  0.241  0.105  0.260
5 84358402 M      20.3  14.3 135.  1297  0.100  0.133  0.198  0.104  0.181
6   843786 M      12.4  15.7  82.6  477. 0.128  0.17   0.158  0.0809 0.209
# ... with 21 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>,
#   X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>,
#   X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>,
#   X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>

Discussion

What are the variables in the dataset? Follow the link to UCI above to find out.

Before thinking about modeling, have a look at your data. There’s no point in throwing a $10^4$ layer convolutional neural network (whatever that means) at your data before you even know what you’re dealing with.

You’ll first remove the first column, which is the unique identifier of each row:

# Remove first column 
df <- df[2:32]
# View head
df %>% head()
# A tibble: 6 x 31
  X2       X3    X4    X5    X6     X7     X8     X9    X10   X11    X12
  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1 M      18.0  10.4 123.  1001  0.118  0.278  0.300  0.147  0.242 0.0787
2 M      20.6  17.8 133.  1326  0.0847 0.0786 0.0869 0.0702 0.181 0.0567
3 M      19.7  21.2 130   1203  0.110  0.160  0.197  0.128  0.207 0.0600
4 M      11.4  20.4  77.6  386. 0.142  0.284  0.241  0.105  0.260 0.0974
5 M      20.3  14.3 135.  1297  0.100  0.133  0.198  0.104  0.181 0.0588
6 M      12.4  15.7  82.6  477. 0.128  0.17   0.158  0.0809 0.209 0.0761
# ... with 20 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>,
#   X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>,
#   X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>,
#   X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>

Question

How many features are there in this dataset?

Discussion

Why did we want to remove the unique identifier?

Now there are too many features to plot so you’ll plot the first 5 in a pair-plot:

# Pair-plot of first 5 features
ggpairs(df[1:5], aes(colour=X2, alpha=0.4))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk unnamed-chunk-4

Discussion

What can you see here?

Note that the features have widely varying centers and scales (means and standard deviations) so we’ll want to center and scale them in some situations. You’ll use the caret package for this. You can read more about preprocessing with caret here.

# Center & scale data
ppv <- preProcess(df, method = c("center", "scale"))
df_tr <- predict(ppv, df)
# Summarize first 5 columns
df_tr[1:5] %>% summary()
      X2                  X3                X4                X5         
 Length:569         Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828  
 Class :character   1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913  
 Mode  :character   Median :-0.2149   Median :-0.1045   Median :-0.2358  
                    Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
                    3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992  
                    Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726  
       X6         
 Min.   :-1.4532  
 1st Qu.:-0.6666  
 Median :-0.2949  
 Mean   : 0.0000  
 3rd Qu.: 0.3632  
 Max.   : 5.2459  

Now plot the centered & scaled features:

# Pair-plot of transformed data
ggpairs(df_tr[1:5], aes(colour=X2))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk unnamed-chunk-6

Discussion

How does this compare to your previous pairplot?

Key Points

  • Plots are always useful tools for getting to know your data.

  • Center and scale your numerical variables using the caret package.