Loading and exploring data
Overview
Teaching: 20 min
Exercises: 10 minQuestions
What is Exploratory Data Analysis (EDA) and why is it useful?
How can I do EDA in R?
Objectives
Use
caretto preprocess data.
Setting up
Make sure you have installed R and RStudio, and installed and loaded the necessary packages from the Setup section.
Loading your data
It’s time to import the first dataset that we’ll work with, the Breast Cancer Wisconsin (Diagnostic) Data Set from the UCI Machine Learning repository.
Do this and check out the first several rows:
# Load data
df <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data",
               col_names = FALSE)
# Check out head of dataframe
df %>% head()
# A tibble: 6 x 32
        X1 X2       X3    X4    X5    X6     X7     X8     X9    X10   X11
     <int> <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>
1   842302 M      18.0  10.4 123.  1001  0.118  0.278  0.300  0.147  0.242
2   842517 M      20.6  17.8 133.  1326  0.0847 0.0786 0.0869 0.0702 0.181
3 84300903 M      19.7  21.2 130   1203  0.110  0.160  0.197  0.128  0.207
4 84348301 M      11.4  20.4  77.6  386. 0.142  0.284  0.241  0.105  0.260
5 84358402 M      20.3  14.3 135.  1297  0.100  0.133  0.198  0.104  0.181
6   843786 M      12.4  15.7  82.6  477. 0.128  0.17   0.158  0.0809 0.209
# ... with 21 more variables: X12 <dbl>, X13 <dbl>, X14 <dbl>, X15 <dbl>,
#   X16 <dbl>, X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>,
#   X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>,
#   X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>
Discussion
What are the variables in the dataset? Follow the link to UCI above to find out.
Before thinking about modeling, have a look at your data. There’s no point in throwing a $10^4$ layer convolutional neural network (whatever that means) at your data before you even know what you’re dealing with.
You’ll first remove the first column, which is the unique identifier of each row:
# Remove first column 
df <- df[2:32]
# View head
df %>% head()
# A tibble: 6 x 31
  X2       X3    X4    X5    X6     X7     X8     X9    X10   X11    X12
  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
1 M      18.0  10.4 123.  1001  0.118  0.278  0.300  0.147  0.242 0.0787
2 M      20.6  17.8 133.  1326  0.0847 0.0786 0.0869 0.0702 0.181 0.0567
3 M      19.7  21.2 130   1203  0.110  0.160  0.197  0.128  0.207 0.0600
4 M      11.4  20.4  77.6  386. 0.142  0.284  0.241  0.105  0.260 0.0974
5 M      20.3  14.3 135.  1297  0.100  0.133  0.198  0.104  0.181 0.0588
6 M      12.4  15.7  82.6  477. 0.128  0.17   0.158  0.0809 0.209 0.0761
# ... with 20 more variables: X13 <dbl>, X14 <dbl>, X15 <dbl>, X16 <dbl>,
#   X17 <dbl>, X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>,
#   X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>, X27 <dbl>, X28 <dbl>,
#   X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>
Question
How many features are there in this dataset?
Discussion
Why did we want to remove the unique identifier?
Now there are too many features to plot so you’ll plot the first 5 in a pair-plot:
# Pair-plot of first 5 features
ggpairs(df[1:5], aes(colour=X2, alpha=0.4))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Discussion
What can you see here?
Note that the features have widely varying centers and scales (means and standard deviations) so we’ll want to center and scale them in some situations. You’ll use the caret package for this. You can read more about preprocessing with caret here.
# Center & scale data
ppv <- preProcess(df, method = c("center", "scale"))
df_tr <- predict(ppv, df)
# Summarize first 5 columns
df_tr[1:5] %>% summary()
      X2                  X3                X4                X5         
 Length:569         Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828  
 Class :character   1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913  
 Mode  :character   Median :-0.2149   Median :-0.1045   Median :-0.2358  
                    Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
                    3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992  
                    Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726  
       X6         
 Min.   :-1.4532  
 1st Qu.:-0.6666  
 Median :-0.2949  
 Mean   : 0.0000  
 3rd Qu.: 0.3632  
 Max.   : 5.2459  
Now plot the centered & scaled features:
# Pair-plot of transformed data
ggpairs(df_tr[1:5], aes(colour=X2))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Discussion
How does this compare to your previous pairplot?
Key Points
Plots are always useful tools for getting to know your data.
Center and scale your numerical variables using the
caretpackage.