These are accompanying notes for a 4-part ‘Practical Statistics for Experimentalists’ Workshop taught at Yale University in the Fall of 2015, the project of which was to introduce experimentalists to statistical and data analytic methodologies and intuitions that they can immediately use in their everyday work, along with methods to implement everything learned in the R programming language. Participants were Graduate students and Postdoctoral Fellows/Associates from the Molecular Biophysics & Biochemistry Department and Yale Medical School. These notes are not intended as stand-alone resources for ‘Practical Statistics for Experimentalists’, but as supporting material for the Workshop. Having said that, they are relatively complete and give a good indication of what was covered. You will not, however, be able to run all the R code embedded in these notes without the required data sets, many of which were kindly supplied by the relevant scientists/authors. All papers/texts referenced in the body of these notes are listed in the ‘References’ section at the end. Feel free to contact me at hugobowne at gmail dot com with any questions.
Workshop I is concerned with i) plotting data, ii) summary statistics and iii) the Central Limit Theorem. Before even attempting to summarize your data or perform any sort of robust statistical analyses on it, it is important to get a feel for what your data looks like. There are many ways to plot data to do this, a number of which we shall delve into in this Workshop, while weighing up the pros and cons of each type of approach to plotting. We shall then look at how we can use summary statistics to summarize data, with particular reference to measures of central tendency and measures of variability. We close with the statement and examples of the Central Limit Theorem, a powerful result that will influence much of what we do when practicing the art of statistics (for example, statistical hypothesis testing).
When we say univariate data, all we mean is that there is one variable. The 1st way to plot such data is in a histogram. But first we need some data. There are 3 ways to get data into R:
We will now go through each of these modes of getting data into R:
The authors of (Gardner et al., 2011) have generously allowed us to use their dataset for this Workshop:
A little bit of biological background: Microtubules are cytoskeletal polymers that alternate between periods of slow growth and rapid shrinkage. The conversion from the growing state to the shrinking state is called microtubule catastrophe and the time it takes from nucleation to catastrophe is called the microtubule’s lifetime. Gardner & co-authors used a TIRF assay (we won’t go into details of what this is here) to measure microtubule lifetime in vitro under 3 conditions: i) in the presence of tubulin alone (well, with buffer + ATP), ii) in the presence of tubulin and the microtubule-associated protein (MAP) MCAK & iii) in the presence of tubulin and the MAP Kip3. In i), the authors perform the assay at different tubulin concentrations and in ii) & iii) at different concentrations of the relevant MAP. The idea of the paper is to see how different MAPs effect the lifetime of microtubules. In today’s Workshop, we will look at cases i) & ii) and not worry about Kip3, for the time being.
We will first look at condition i), which we have as a .csv (“Comma Separated Values”) file. The data in this file is the lifetime (defined as the time from nucleation to catastrophe) of microtubules grown in vitro under different tubulin concentrations and each column is a different [Tb].
How we will store & visualize the data in R: We will use a visualization library called ggplot2 – there is a bit of a learning curve in figuring out how to use it but once you’ve got it going on, it can be used to create beautiful figures with ease. It is based around Leland Wilkinson’s Grammar of Graphics and many publications, such as the New York Times, use it. There is a great cheat sheet for ggplot2 that you should check out. In order to use ggplot2 to its fullest capabilities, we will store our data as data frame objects – we will play around with these a great deal in these Workshops – all you need to know now is that ‘[a] data frame is a list of variables of the same number of rows with unique row names’, according to R’s documentation (if this definition leaves the concept still opaque, we’ll figure out below what a data frame is by looking at one!).
library( ggplot2 ) #load ggplot2 library for handsome plotting structures
setwd("~/Documents/practical_statistics_hba/")#You need to set your working directory
#in order to access the files/data necessary -- we'll discuss this.
data <- read.csv("data/lifetimes_tubulin.csv" ) #load data from .csv
##this is a dataframe -- a very important structure in R
colnames(data) #print the column names in the datframe 'data'
## [1] "X12_uM" "X7_uM" "X9_uM" "X10_uM" "X14_uM"
x <- data$X12_uM #assign a single column of df to the variable 'x'
qplot( x , xlab = "MT lifetime at 12uM") #plot a histogram of the data at 12uM
Important Note: The help() function in R will be one of your best friends; if, for example, you don’t know what ‘qplot’ does, execute the following in your console:
help(qplot)
Note: The column names of the data frame have been adjusted by R to begin with an ‘X’ so that they are syntactically valid (in this case, so as to not begin with a numeric value).
Exercise 1 (~15 minutes):
Notes:
When you install R, it ships with a plethora of datasets built into it. We will look at the mtcars dataset. For a list of other available inbuilt datasets, execute the following command in your console:
data()
We will now look at the mtcars dataset:
cars <- mtcars #pre-existing dataset in R
p1 <- ggplot(data = cars , aes(x=mpg)) #initialize plot to look at miles per gallon
p1 + geom_histogram( binwidth = 1 ) #plot histogram & play around with binwidth if desired
I know that this section is supposed to focus on plotting histograms of univariate data BUT ggplot2 makes plotting bivariate data so easy that I couldn’t help including a little bit here:
p2 <- ggplot( data = cars , aes( x = mpg , y = wt)) #initialize plot to look at the relationship
#between variables (bivariate) miles /gallon & weight (lb/1000)
p2 + geom_point() #plot a scatterplot
#plot scatter of mpg vs weight; plot linear fit + 95% confidence bands:
p2 + geom_point() + stat_smooth( method = "lm") + xlab("miles per gallon") + ylab("weight (lb/1000)")
R contains inbuilt random number generators that we can use to draw random numbers from, for example, uniform and Gaussian distributions. There are two points to note:
set.seed(42) # set the seed of your PRNG
x <- rnorm(1000 , mean = 0 , sd = 1) ##generate 1000 data points from a normal
#distribution with mean = 1, sd = 1
qplot( x ) #plot a histogram of the data
We can also produce a relative frequency plot as follows:
df <- as.data.frame(x) #store Gaussian distributed data in a dataframe
m <- ggplot(df , aes(x)) #initiate plotting structure
m + geom_histogram(aes(y = ..density..)) #plot relative frequency plot
Exercise 2 (~10 minutes):
Note: I haven’t told you the functions used to generate these datasets. You’ll need to find them somehow! A huge part of coding is finding the right function. Hint: search engines are your friends.
Pros & Cons of Histograms: Histograms are great visual tools because they allow you to see immediately how your variable is distributed; they should, however, be taken with a grain of salt, because what you see is also a function of your choice of bin size.
The ECDF
\[F(x) = \frac{\text{number of data points} \leq x}{n},\]
where \(n\) is the total number of data points.
Let’s first have a look at the ECDF for some of the microtubule lifetime dataset:
ggplot(data, aes(x = X12_uM)) + stat_ecdf() + xlab("Microtubule lifetime") +
ylab("ECDF")#plot ECDF of the data
Pros & Cons of CDFs:
Exercise 3 (~5 minutes):
Box plots are a great way to visualize where most of the data lies. In a box plot, the top and bottom boundaries of the box are the 3rd and 1st quartiles of the data, respectively, the line in the middle the median and the circular data points the outliers (all of these are to be defined precisely below). We now plot the microtubule lifetime data as a set of box plots with [Tb] on the x-axis.
However, to do so, we want to have the microtubule lifetime data in a different form – we would like a data frame in which each row is a single experiment with 2 columns: (i) the recorded lifetime & (ii) the tubulin concentration of the experiment. There is a great package ‘tidyr’ that helps you reshape data frames: two important function are ‘gather()’ and ‘spread()’, which make ‘wide’ data ‘long’ and ‘long’ data ‘wide’, respectively– we’ll see precisely what this means in a minute – for more information see here.
library(tidyr) #this library contains the function that we'll need in order to reshape the dataframe
#as desired
mt_data <- gather( data ) #gather data as described above
colnames(mt_data) <- c("tubulin_concentration" , "lifetime") #rename the columns
mt_bp <- ggplot( mt_data , aes( x = factor(tubulin_concentration) , y = lifetime)) #initiate plot
mt_bp + geom_boxplot() + xlab("[Tb]") #plot the box plot
## Warning: Removed 1540 rows containing non-finite values (stat_boxplot).
Notes:
Pros & Cons of box plots:
For small amounts of data, you should plot all the data, not just the box plot (which summarizes aspects of the data). A column scatter plot (with jitter) is one way to achieve this and you can even layer it on top of a box plot:
#data from here: http://bl.ocks.org/mbostock/4061502#morley.csv
datamm <- read.csv("data/mm.csv")
p <- ggplot( datamm , aes( factor(Expt) , Speed ))
p + geom_boxplot() + geom_jitter(aes(colour = factor(Expt))) + guides(colour=guide_legend(title="Expt")) +
xlab("Experiment") +ylab("Speed of Light (km/s minus 299,000)")
Note: the data here is from the famed Michelson-Morley experiment – in which the data is the speed of light from different experiments.
Question: What is a problem with the way in which I have plotted both the box plot and the jitter plot above?
Pros & Cons of scatter plots:
If you have more data, it may overlap in a jitter plot. One way to avoid this is to use a beeswarm plot, in which the data is cleverly spread out. We will not use ggplot2 for this, as to do so gets too technical, but if you’re interested, see here.
library(beeswarm) #load beeswarm library
data(breast)# load data of '[t]umor molecular measurements and outcome from breast cancer patients
#create beeswarm plot below:
beeswarm(time_survival ~ ER, data = breast,
pch = 16, pwcol = 1 + as.numeric(event_survival),
xlab = "", ylab = "Follow-up time (months)",
labels = c("ER neg", "ER pos"))
legend("topright", legend = c("Yes", "No"),
title = "Censored", pch = 16, col = 1:2)
In order to see more about the dataset used above, execute the following command:
help(breast)
Exercise 4 (~10 minutes): create a beeswarm plot of the microtubule lifetime data with (i) tubulin concentration on the x-axis & (ii) lifetime on the y-axis.
Question: Is there too much, too little or the correct amount of data for a beeswarm plot here?
Pros & Cons of beeswarm plots:
How can we numerically describe & summarize data? We use statistics: but what is a statistic? To liberally paraphrase Phil Gregory (in Bayesian Logical Data Analysis for the Physical Sciences), a statistic is ‘a function that maps any data set to a numerical measure’ (what Gregory actually wrote is that a statistic is ‘any function of the observed random variables in a sample such that the function does not contain any unknown quantities’!). We use summary statistics to easily identify attributes such as central tendency and variability, also called dispersion. We shall explore these now.
The mean of a dataset \(D=\{x_i\}\) is the average \(\mu = \frac{1}{n}\sum\limits_i x_i\). This is a very intuitive statistic & also one of the few statistics that most lay-people recognize as a quantity-of-interest: but why the hell is it of such import? Is it arbitrary? The short answer is ‘no’ and the long answer is too long. The easiest way to quantitatively motivate it here is that the mean is the number that is, on average, closest to all of the data, that is, it minimizes the squared error sum (you can prove this with a bit of calculus & algebra!):
\[p(x) = \sum\limits_i(x - x_i)^2.\]
To calculate the mean of dataset, we use the function ‘mean’ in R:
#data from here: http://bl.ocks.org/mbostock/4061502#morley.csv
datamm <- read.csv("data/mm.csv") #load data
datamm_ep1 <- subset(datamm , Expt == 1) ##what have I done here?
mean_exp1 <- mean( datamm_ep1$Speed ) #compute the mean
print( mean_exp1) #print the mean
## [1] 909
You can also calculate the mean of each column of a data frame by using the ‘apply’ function:
apply( data , 2 , mean) #compute the mean of the columns of 'data'
## X12_uM X7_uM X9_uM X10_uM X14_uM
## 380.5538 NA NA NA NA
Question: Whoa!! This reported NA’s. Why did this happen? How can we avoid it?
Answer & Code:
#for more on 'apply' and its variants, see here: #https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
#compute the mean of the columns of 'data', ignoring NA's:
apply( data , 2 , function(x) mean(x,na.rm = TRUE))
## X12_uM X7_uM X9_uM X10_uM X14_uM
## 380.5538 323.6842 305.2549 355.5804 468.5106
Exercise 5 (~10 minutes):
Compute the mean across all experiments in the Michelson-Morley dataset. There are two methods that I would suggest trying:
Note: the mean is not always a good statistic, particularly for skewed and bimodal distributions (more to come on this) or datasets with outliers.
Here are 3 cases in which the mean is not a great summary statistic to report:
Question: what have the outliers done to the mean in the above?
data from a distribution with 2 peaks (e.g. heights of multiple genders or ethnicities; see here and here for information on the height data);
data from a skewed distribution (e.g. gamma distribution – arises from multistep processes) – in the example below, the mean is ~2 and the peak of the histogram occurs at ~1:
x <- rgamma(10000 , 2 ,1) #generate data from a skewed gamma distribution
qplot(x) #plot the data
mean(x) #compute the mean of the data
## [1] 2.021093
Dealing with these issues: To deal with the problem of outliers, statisticians like to look at the median rather than the mean; if you’re actually interested in the most likely data point, that is, where the highest peak is in the histogram, you look at the mode.
To find the median of your dataset, list all data from smallest to largest: the median is then defined to be the number that occurs in the middle of this list (or the average of the middle two, if there are an even number of data points). Intuitively, what the median essentially does is split the data into a top 50% and bottom 50%.
mean(x) #compute the mean
## [1] 2.021093
median(x) #compute the median
## [1] 1.678056
The median is very robust to outliers, in the sense that up to ~50% of the data can be contaminated to contain outliers and this will not alter the median. For example the dataset \(\{1,2,2,3,4,5,5,6,7\}\) & \(\{1,2,2,3,4,5,5,60,70\}\) both have a median of \(4\) and yet their means differ significantly due to the presence of outliers in the latter.
I include a demonstrative plot below. The bar in the middle of each box is the median of the dataset: you can see that the medians are the same and yet the means are very different due to the presence of outliers:
x <- c(1,2,2,3,4,5,5,6,7) #generate some data
y <- c(1,2,2,3,4,5,5,60,70)#generate some data
df <- data.frame(x,y) #combine data into a dataframe
print( c( mean(x) , mean(y)) ) #compute means
## [1] 3.888889 16.888889
print( c( median(x) , median(y)) ) #compute medians
## [1] 4 4
dfm <- gather(df) #reshape the data
mp <- ggplot(dfm , aes( x = key , y = value)) #initiate plotting structure
mp + geom_boxplot() + xlab("") #plot as a box plot
The mode is defined to be the value that is most likely to occur (i.e., where the peak in the histogram occurs). When the histogram/probability distribution function has two or more modes, we call it bi-modal and multi-modal, respectively. We will see that the mode can be of great interest later when performing parameter estimation (for example, maximum likelihood estimation and Bayesian parameter estimation). See an example of trimodality here:
Also see here for some interesting notes on relation between mode, median, mean & something called the Pearson mode skewness.
The range of a dataset \(\{x_i\}\) is \(r:=x_{max} - x_{min}\), the difference between the maximum and the minimum. Although it tells us the size of the region in which all the data lies, it doesn’t tell us anything else and it is also completely sensitive to outliers.
The variance is a measure of how far a dataset is spread out. Recalling the mean \(\mu = \frac{1}{n}\sum\limits_i x_i\), the variance is defined to be
\[\sigma^2 = \frac{1}{n}\sum\limits_i(x_i-\mu)^2\]
and \(\sigma\), the square root of the variance, is called the standard deviation of the data.
Cool note: having two summary statistics, such as a measure of central tendency (e.g., mean) and a measure of dispersion (e.g. variance), can be enough to fully describe many of the most important distributions. For example, an exponential distribution is characterizable by its mean alone, a normal/Gaussian distribution by its mean and variance, and a Pareto (power-law) distribution by its mean and variance.
Motivation:
Equation: \(P(x) = \frac{1}{\sigma\sqrt{2\pi}}\text{exp}(-\frac{(x - \mu)^2}{2\sigma^2}),\) where \(\mu\) and \(\sigma^2\) are the mean and variance, respectively, of the distribution.
Motivation:
Equation: \(P(x) = \frac{\text{exp}(-x/\mu)}{\mu}\), where \(\mu\) is the mean of the distribution.
Key: variance and standard deviation are sensitive to outliers (see here, for example). Also, standard deviation can give us insight into the existence of outliers:
x <- c(1,2,2,3,4,5,5,6,7) #generate some data
y <- c(1,2,2,3,4,5,5,60,70)#generate some data
df <- data.frame(x,y) #combine data in dataframe
print( apply( df , 2 , mean) ) #compute SDs
## x y
## 3.888889 16.888889
print( apply( df , 2 , sd) ) #compute means
## x y
## 2.027588 27.424644
Question: How can we see from the mean & the standard deviation of the above datasets that one of them may contain outliers?
Exercise 6 (~10 minutes): Compute the standard deviation of one of the microtubule lifetime datasets (for all concentrations) OR the the Michelson-Morley data (all experiments).
Recall that the median splits the data into a top 50% and a bottom 50%. We can similarly split these regions into their top and bottom halves: \(Q_1\), the 1st quartile, is defined to be the median of the bottom 50% (i.e. the median of the data consisting of all \(x_i\) less than the median) & \(Q_3\), the 3rd quartile, is defined to be the median of the top 50% of the data. We then define the interquatile range to be \(IQR = Q_3 - Q_1\), a measure of how spread out the data is (and it is a robust measure! particularly with respect to outliers).
Having defined the median and quartiles, we can now define a box plot (also known as a box-and-whisker diagram), which provides a very useful way of visualizing data. In a box plot,
Here are box plots from the famed Michelson-Morley experiment, in which the data is the speed of light in different experiments: There are also bivariate analogs of box plots, called bag plots. Check them out here.
Let’s say we have drawn \(n\) random, independent samples \(X_1, X_2, \ldots , X_n\) from a probability distribution \(P\), where \(P\) has mean \(\mu\) and variance \(\sigma^2\). Now the sample mean \(\mu_X\) provides an estimate of the population mean \(\mu\). The question is “how good is the estimate \(\mu_X\) of \(\mu\)?” Another way to ask this is as follows: “if I performed the same experiment a number of times, that is, drew another \(n\) samples and calculated the sample mean, what would be the standard deviation of all of these sample means?” The answer is that the standard error of the sample mean is
\[SE_{\bar{x}} = \frac{s}{\sqrt{n}},\]
where \(n\) is the sample size and \(s\) is the sample standard deviation. This follows, in fact, from the Central limit theorem, which we state below. First note, though, that there are diminishing returns of the number of data points with respect to the standard error, as it decreases as the inverse of \(\sqrt(n)\), so to be 10 times as certain in your estimate of the mean, you actually require 100 times as many data points!
Example of computation:
data <- read.csv("data/lifetimes_tubulin.csv") #load data from .csv
slt <- apply( data , 2 , function(x) sd(x , na.rm = TRUE)) #mean lifetime
n <- apply( data , 2 , function(x) sum(!is.na(x))) #number of data points
sem <- slt/sqrt(n) #standard errors
print(sem)
## X12_uM X7_uM X9_uM X10_uM X14_uM
## 8.792836 8.802979 11.485072 13.121386 21.743109
Exercise 7 (~5 minutes): Figure out what each line of the code above does.
When do I report the standard deviation and when do I report the standard error? A general rule is as follows: if you are interested in seeing and/or reporting how spread out your data is, state or visualize the standard deviation; if, on the other hand, you are interested in making inferences about the data, such as the value of the population mean or the statistical significance of an effect, report the standard error of the mean. See (Cumming et al., 2007) for further reading. Also see this paper and this paper.
The Central limit theorem, in essence, tells about how the sample mean, as an estimator, is itself distributed, in the limit of large samples:
Let \(X_1, X_2, \ldots X_n\) be a random sample drawn from a probability distribution \(P\) (with mean \(\mu\) and variance \(\sigma^2\)). The sample average is defined as
\[S_n = \frac{X_1 + X_2 + \ldots + X_n}{n}\]
and the Central Limit Theorem states that, as \(n \to \infty\), \(S_n \to N(\mu,\sigma^2/n)\). We will demonstrate part of its power when dealing with Student’s t-test in Workshop II. For the time being, let’s look at what Francis Galton (Victorian statistician, progressive, polymath, sociologist, psychologist, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist and psychometrician) had to say about the CLT:
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The larger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.
##Here we perform a simple demonstration of the Central Limit Theorem
##Random Variable defined by p(0)=p(1)=0.5. Draw n samples (e.g. n = 1000)
##& retrieve the mean. Do this m times (e.g. m = 10,000) and plot
##the distribution of means:
n <- 1000 #number of samples for each trial
m <- 10000 #number of trials
x <- rbinom( m , n , 0.5 )/n # sample distribution and return vector of means
qplot(x , binwidth = 0.005) #plot histogram of means
This example will take the form of an Exercise 8 (~15 minutes): i) load the \(12\mu M\) tubulin microtubule lifetime dataset; ii) compute the mean of a random subset of size \(n=300\); iii) do this 10,000 times and plot the resulting histogram;
This is an extra example, for those who wish to jump into some Monte Carlo simulations after class; herein, we estimate \(\pi\) using a Monte Carlo method:
##Here we perform a simple demonstration of the Central Limit Theorem
##We estimate pi (ratio of circle circumference to diameter) using a Monte Carlo method:
##the method is to drop n points inside a square of side length 2, then see how many points
##fall inside the circle of radius 1 inside the square. As area(square) = 4 & area(circle) = pi
##an estimate for pi is = 4*(number of points in circle)/(number of points in square).
##We then use this procedure to estimate pi m times and look at the statistics & distribution
##Of these estimates.
n <- 2000 #number of points to drop in the square
m <- 1000 #number of trials
#we write a small function here to approximate pi by the method described above
approx_pi <- function(n){
points <- replicate(2, runif(n)) #generate points in the square
distances <- apply( points , 1 , function(x) sqrt(x[1]^2 + x[2]^2)) #compute distance of each point from centre
#see here for more on 'apply':
#https://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
points_in_circle <- sum(distances < 1) #count points in circle
pi_approx <- 4*points_in_circle/n #approximate pi
return(pi_approx) #function output designated here
}
x <- replicate(m,approx_pi(n)) #approximate pi m times
m <- mean(x) #compute mean of your estimates
qplot( x , binwidth = 0.01) #plot distribution of estimates