These are accompanying notes for a 4-part ‘Practical Statistics for Experimentalists’ Workshop taught at Yale University in the Fall of 2015, the project of which was to introduce experimentalists to statistical and data analytic methodologies and intuitions that they can immediately use in their everyday work, along with methods to implement everything learned in the R programming language. Participants were Graduate students and Postdoctoral Fellows/Associates from the Molecular Biophysics & Biochemistry Department and Yale Medical School. These notes are not intended as stand-alone resources for ‘Practical Statistics for Experimentalists’, but as supporting material for the Workshop. Having said that, they are relatively complete and give a good indication of what was covered. You will not, however, be able to run all the R code embedded in these notes without the required data sets, many of which were kindly supplied by the relevant scientists/authors. All papers/texts referenced in the body of these notes are listed in the ‘References’ section at the end. Feel free to contact me at hugobowne at gmail dot com with any questions.

Workshop III is concerned primarily with model parameter estimation (model fitting) and model selection. We first introduce maximum likelihood estimation (MLE), as this is a default estimation method for many working scientists and softwares. We provide worked theoretical example and worked experimental examples from the recent cell biological literature. We provide an intuition towards model selection and an example of a model selection criterion, the Akaike Information Criterion. Although this is NOT a Workshop on mathematical modeling, but rather one on parameter fitting and model selection, we provide a discussion of how to think about such modeling. We then turn our attention towards simple linear regression, which is fact a special form of MLE! We provide examples, along with an in depth discussion of what assumptions underlie linear regression and when you should and should not use it. We discuss bootstrapping to obtain confidence intervals on parameter estimates. We generalize this discussion to modeling nonlinear relationships utilizing nonlinear least squares. We promote viewing this landscape of data analytic techniques as a nested sequence as follows:

\[\text{linear regression} \subset \text{nonlinear least squares} \subset \text{maximum likelihood estimation}.\]

1 Maximum likelihood estimation

Motivation: Let’s say that we have a mathematical model and some data that we wish to fit the model to: all we mean here by ‘fitting’ is finding good estimates for the parameters in the model. Maximum likelihood estimation is a common and widely applicable method of such parameter estimation. We will see, for example, that simple linear regression is a special case of maximum likelihood estimation, as is nonlinear least squares.

1.1 Example 1: a coin-flip

Let’s say that we want figure out the proportion of cases in which a mutation has a particular effect: for example, how often a mutant dies.

If we think of survival as heads and death as tails, we can model this as a biased coin flip with \(P(H) = p\) and \(P(T) = 1 - p\).

Set-up: I flip a biased coin \(n\) times and retrieve \(k\) heads and \(n-k\) tails,

Questions:

  1. What is my intuitive estimate of the probability of heads \(P(H)=p\)?
  2. How certain can I be of this estimate? In other words, what type of confidence intervals can I place on it.

An intuitive answer to the 1st question is the best estimate \(\hat{p} = k/n\), the number of heads divided by the total number of coin tosses (note that \(\hat{}\) denotes the best estimate). We will now see why.

1.1.1 Estimating the binomial probability \(p\)

The binomial distribution

Recall that, if the probability of heads \(P(H)=p\), then the probability of getting \(k\) heads out of \(n\) coin flips is given by the binomial distribution

\[P(k \text{ heads }| n \text{ tosses }, p) = {n\choose k}p^k(1 - p)^{n-k}.\]

In the above \(p^k\) is the probability of getting \(k\) heads, \((1-p)^{n-k}\) that of getting \(n-k\) tails and \({n \choose k}=\frac{n!}{k!(n-k)!}\) the number of ways to arrange \(k\) heads and \((n-k)\) tails.

What does maximum likelihood mean?

Now in our case we don’t want the probability of flipping a certain number of heads, given \(p\), we want the likelihood that \(p(H)=p\), given a certain number of heads! To this end, we define the likelihood of any given \(p\) given the data \(D\) to be the probability of that data \(D\), given \(p(H) = p\):

\[\mathcal{L}(p | k \text{ heads}) := P(k \text{ heads }| n \text{ tosses }, p) = {n\choose k}p^k(1 - p)^{n-k}.\]

The maximum likelihood estimate (MLE) \(\hat{p}\) of \(p\) is the value that maximizes the likelihood function \(\mathcal{L}(p | k \text{ heads})\). A little bit of calculus & algebra later (see Appendix) reveals that

\[\hat{p} = \frac{n}{k},\]

precisely as our intuition told us.

Computational Note: Because the likelihood is usually very large, instead of maximizing the likelihood, we usually minimize the negative log-likelihood, denoted by \(-\mathcal{LL}\) (more on this below). We will need to recall our basic log laws for this purpose:

A couple of important log laws:

  1. \(\text{log}(xy) = \text{log}(x) + \text{log}(y)\);
  2. \(\text{log}(x^a) = a\text{log}(x)\);

Example: In the case of the binomial distribution above, the negative log-likelihood

\[-\mathcal{LL} = -\text{log}{n\choose k} - k\text{log}p - (n-k)\text{log}(1-p)\]

1.1.2 How confident can we be of our estimate \(\hat{p}=\frac{n}{k}\)?

Now how confident can we be of this estimate? Our intuition is merely qualitative when it comes to this question: all we can really say is that the more data we have, the more we can be sure of our estimate. To quantitate this intuition, we use the Central Limit Theorem.

Discussion point: as we saw in Workshop I when discussing the Central Limit Theorem, this idea of confidence in our estimate is related to the variance of the results we would see when performing the experiment many times. Discuss this.

1.1.3 The CLT and confidence intervals on a biased coin flip

First notice that \(\hat{p}=k/n\) is also \(S_n\), an estimate of the mean of the distribution given by \(P(1)=p, P(0)=1-p\) (that is, we call a head 1 and a tail 0). Then the Central Limit Theorem tells us that as \(n\) gets large (in practice, \(n>20\) should be fine), \(\hat{p}=S_n\) is distributed like \(N(\mu,\sigma^2/n),\) where \(\mu\) and \(\sigma^2\) are the mean and variance, respectively, of the binomial distribution. Hence \(\hat{p}\) is ~normally distributed with mean \(p\) and standard deviation \(\sqrt{\frac{p(1-p)}{n}}\).

Note: the standard deviation of this estimate is a maximum when \(p=0.5\), that is, the largest uncertainty occurs when the coin is fair! This is intuitive in the sense that, if \(p=1\), we could be fairly certain that it is biased with few coin flips as it would be coming up heads all of the time.

Moreover, as \(\hat{p}\) is normally distributed and recalling that \(95\%\) of the area under a normal curve lies within \(\approx 1.96\) standard deviations from the mean, \(95\%\) confidence intervals on the estimate \(\hat{p}\) are given by \[\hat{p}\pm 1.96\times\text{SD}(\hat{p}).\] When \(p = 0.5\), this is \(\approx \frac{k}{n} \pm \frac{1}{\sqrt{n}}\) and when \(n=400\), for example, this is \(\approx 0.5 \pm 0.05\). If, on the other hand, \(P(H)=0.75\), \(1.96\times\text{SD}\approx 1/\sqrt{4n/3}\) and to get the same size \(95\%\) confidence intervals, we only require 300 tosses of the coin.

Computation example: performing MLE in R.

In order to use R to compute the MLE, we just need to give it i) the data & ii) the function we wish to minimize. As a sanity check, we now generate some binomial data and use R to compute the the MLE.

Essential knowledge: if there are \(n\) parameters \(p_1,\ldots,p_n\) in a model and the data \(D = \{D_i\}\) are independent, then the negative log-likelihood

\[-\mathcal{LL}(p_1,\ldots,p_n|D) = -\sum_i \text{log}P(D_i|p_1,\ldots,p_n)\]

and this is the function that we minimize in R as below, after this note.

Note: Independence of data is required so that \(P(D|p_1,\ldots,p_n) = \prod_i P(D_i|p_1,\ldots,p_n)\).

library(stats4) #load library
p <- 0.4 #set the binomial probability
n <- 5000 #number of coin flips
x <- rbinom( n , 1 , p ) # results of m coin flips
#define log likelihood function below:
LL <- function( p ){
  R <- dbinom( x , 1, p , log = TRUE) #binomial function w/ probability p
  #NOTE! in the above, log = TRUE automatically takes logs for us: awesome!
  -sum(R) #-ve log likelihood
}
fit <- mle(LL , start = list(p = 0.9))
summary(fit)
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = LL, start = list(p = 0.9))
## 
## Coefficients:
##    Estimate Std. Error
## p 0.3959998 0.00691638
## 
## -2 log L: 6713.564

Exercise 1 (~5 minutes): Play around with the binomial probability \(p\) and the number of coin flips \(n\) in the above code. Make sure that the MLE algorithm is working correctly.

1.2 Example 2: Microtubule lifetimes

1.3 What is the distribution of microtubule lifetimes?

Before (Gardner et al., 2011), it was widely believed that microtubule lifetimes were exponentially distributed (but see Odde et al., 1995, 1996; Stepanova et al., 2010):

\[P(t|\mu) = \mu \text{exp}(-t\mu),\]

where \(\mu\) is the characteristic rate of catastrophe. For this reason, catastrophe was thought to be a single-step (memoriless) process (see Appendix for further details).

Gardner and co-authors hypothesized that lifetimes may actually follow gamma distributions, which are thought of as resulting from multi-step processes. A quality of gamma-distributed lifetimes is that young microtubules are protected.

#a plot of typical exponential & gamma distributions
x <- seq(0,10, by = 0.1) #generate vector for x-axis -- seq() is a great function so remember it!
y1 <- dexp(x , rate = 1) #exponential distribution function evaluated on vector x
y2 <- dgamma(x , shape=2)#gamma distribution function evaluated on vector x
df <- as.data.frame( c(x,y1,y2)) #combine these vectors to give a dataframe
p <- ggplot( df , aes(x , y1 , col = Distribution)) #initiate plotting structure
p + geom_line(aes(col = "Exponential")) + geom_line(aes (x,y2, colour = "Gamma") ) +
  ylab("Probability") #plot figure of distributions

For completeness, I include here the equation for the gamma distribution. It has 2 parameters, a shape parameter \(\alpha\) & a rate parameter \(\mu\). The probability distribution function is given by

\[P(t|\alpha , \mu) = \frac{\mu^\alpha}{\Gamma(\alpha)}t^{\alpha-1}\text{exp}(-\mu t),\]

where \(\Gamma(\alpha)\) is the Gamma function.

Note: When \(\alpha = 1\), \(\Gamma(\alpha) = 1\) and \(P( t | 1 , \mu) = \mu\text{exp}(-\mu t)\). This is an exponential distribution.

Aim: We want to see whether tubulin lifetimes are exponentially or gamma distributed. In order to do so, we proceed in 2 steps:

  1. We perform parameter estimation (MLE) for both types of distributions;
  2. We use a criterion for model selection.

We already know how to do 1. and will soon know how to do 2. also. We now proceed to performing MLE for the exponential distribution:

1.3.1 Exponential distribution for tubulin alone

Recalling the form of the exponential distribution

\[P(t|\mu) = \mu \text{exp}(-t\mu)\]

and assuming independence of data points, the likelihood function, given data \(D=\{t_i\}_{i\in I}\), is

\[\mathcal{L}(\mu | t) = \prod_{i\in I}P(t_i | \mu) = \prod_{i\in I}\mu \text{exp}(-t_i\mu).\]

Exercise 2 (~10 minutes):

  1. Load “lifetimes_tubulin.csv” and set the variable ‘x’ to be the 1st column;
  2. I have written below a template for the negative log-likelihood function: complete the code (use the binomial log likelihood above as a reference, if necessary):
LL_exp <- function( rate ){
  R <- #code for function here
  -sum(R) #-ve log likelihood
}
  1. Fit the exponential distribution to the data using ‘mle()’ and do not forget the argument ‘start’ that initializes the optimization process;
  2. Use the ‘summary()’ function to print out the results;

Let’s now plot the probability density of the data with the exponential fit for an eye-test of how good the fit is:

x <- data$X12_uM #lifetime data at 12uM
df <- as.data.frame( x = x) #turn it into dataframe to plot
###in these 3 lines, we build a dataframe of the fit:
xx <- seq(0,max(x)) #x-axis
xex <- dexp(xx , rate =coef(fit_exp)) #y-axis
df1 <- data.frame( "x" = xx , "y" = xex ) #dataframe
p <- ggplot( df , aes( x = x))  #initiate plotting structure for data density
#plot density and add layer of fit on top:
p +  geom_histogram(aes(y = ..density..)) + geom_line( data = df1 , aes(x=x , y=y) , colour = "red" , size = 2)

Exercise for home (if interested!): do the same (i.e. fit an exponential distribution and plot it) for the assay in which the MAP MCAK is also present (any [MCAK]).

1.3.2 Gamma distribution for tubulin alone

We now fit a gamma distribution to the data in an analogous manner:

data <- read.csv("data/lifetimes_tubulin.csv" ) #load data from #.csv
x <- data$X12_uM #choose column
LL_gamma <- function( shape , rate ){
  R <- dgamma( x , shape , rate , log = TRUE)#gamma distribution --note log = TRUE computes the log
  -sum(R) #-ve log likelihood
}
#perform the fit:
fit_gamma <- mle( LL_gamma , start = list( shape = 1 ,rate = 1) , method = "L-BFGS-B" , lower = 0.00001)
summary(fit_gamma) #summarize the fit
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = LL_gamma, start = list(shape = 1, rate = 1), 
##     method = "L-BFGS-B", lower = 1e-05)
## 
## Coefficients:
##          Estimate   Std. Error
## shape 2.973536395 0.1433778927
## rate  0.007815743 0.0004036644
## 
## -2 log L: 9274.501

Note: In the ‘mle’ function, we have a new argument ‘lower’ – this is to ensure that the optimization algorithm does not try values below a certain threshold (e.g. a gamma distribution is not defined for shape \(<0\)); in order to have such a bound, we have necessarily used a slightly different algorithm (‘method’), the Limited-memory BFGS Broyden–Fletcher–Goldfarb–Shanno (Bounded). Right!

Now we plot the gamma fit:

x <- data$X12_uM #lifetime data at 12uM
df <- as.data.frame( x = x) #turn it into dataframe to plot
###in these 3 lines, we build a dataframe of the fit:
xx <- seq(0,max(x)) #x-axis
xgam <- dgamma(xx , shape = coef(fit_gamma)[1] , rate =coef(fit_gamma)[2]) #y-axis
df1 <- data.frame( "x" = xx , "y" = xgam ) #dataframe
p <- ggplot( df , aes( x = x))  #initiate plotting structure for data density
#plot density and add layer of fit on top:
p +  geom_histogram(aes(y = ..density..) , binwidth = 75) +
 geom_line( data = df1 , aes(x=x , y=y) , colour = "red" , size = 2) +
  xlab("lifetime")

Plotting the CDFs (cumulative distribution functions) of empirical & modeled distributed variables can help us visualize differences, similarities and how good fits are (in a qualitative sense), for example:

P <- ecdf(x) #this is the ECDF (of the data) as a function
xax <- seq(0,max(x) , by = 1) #define the x-axis
ec <- P(xax) #ECDF as function of x-axis 
exp_fit <- pexp(xax , rate =coef(fit_exp)) #exponential CDF for y-axis
gam_fit <- pgamma(xax , shape = coef(fit_gamma)[1] , rate =coef(fit_gamma)[2]) #gamma CDF for y-axis
#build dataframe containing x-axis, ECDF, exponential & gamma CDFs for plotting
df <- data.frame( "x" = xax , "ecdf" = ec , "exponential" = exp_fit , "gamma" = gam_fit)
p <- ggplot( df , aes(x=x , y = ecdf , colour = "ECDF")) #initiate plotting structure
##now we plot it all:
p + geom_line() + geom_line(aes(y=exp_fit, colour = "Exponential")  , label = "1") + geom_line(aes(y=gam_fit, colour = "Gamma") ) +
  ylab("CDF") + xlab("microtubule lifetime") + 
    scale_colour_manual(name = 'Distribution', values = c("black", "red", "blue") )

Discussion: What does the above CDF plot tell you about the goodness-of-fit for the exponential and gamma models, respectively?

1.3.3 A key note on mathematical modeling

How do you know to fit an exponential distribution and then to try a gamma distribution? What are the rules of thumb for modeling this type of data? Unfortunately, this Workshop is NOT about mathematical modeling, an incredibly rich super-field of its own. However, do let me say a few things:

  1. If you are modeling times, such as microtubule lifetimes, exponential distributions are a good place to start as they are used to model single-step processes, just as gamma distributions are used to model multi-step processes. There are a number of others you could try, such as Weibull distributions;
  2. It is essential to remember that some models have mechanistic/physical motivations/interpretations and that others do not;
  3. Models that make testable predictions are essential: if you can’t do an experiment to disprove a model, then what is it worth?
  4. In the end, collaborate with a theorist/modeler/statistician: I need to stay in business!

1.3.4 Gamma distribution for tubulin & MCAK

We now fit a gamma distribution for microtubules grown in the presence of MCAK:

data <- read.csv("data/lifetimes_mcak.csv") #load data from #.csv
x <- data$X9_nM_MCAK #choose column
x <- x[!is.na(x)] #remove NAs
LL_gamma <- function( shape , rate ){
  R <- dgamma( x , shape , rate , log = TRUE) #gamma distribution -- log = TRUE already logs it
  -sum(R) #-ve log likelihood
}
#perform fit
fit_gamma_mcak <- mle( LL_gamma , start = list( shape = 1 ,rate = 1) , method = "L-BFGS-B" , lower = 0.00001)
summary(fit_gamma_mcak) #summarize fit
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = LL_gamma, start = list(shape = 1, rate = 1), 
##     method = "L-BFGS-B", lower = 1e-05)
## 
## Coefficients:
##          Estimate   Std. Error
## shape 1.318850219 0.0810288245
## rate  0.006730101 0.0004894076
## 
## -2 log L: 5062.775
df <- as.data.frame( x = x) #turn lifetime data into dataframe to plot
###in these 3 lines, we build a dataframe of the fit:
xx <- seq(0,1600) #x-axis
xex <- dgamma(xx , shape = coef(fit_gamma_mcak)[1] , rate =coef(fit_gamma_mcak)[2]) #y-axis
df1 <- data.frame( "x" = xx , "y" = xex ) #dataframe
p <- ggplot( df , aes( x = x))  #initiate plotting structure for data density
#plot density and add layer of fit on top:
p +  geom_histogram(aes(y = ..density..) , binwidth = 75) +
 geom_line( data = df1 , aes(x=x , y=y) , colour = "red" , size = 2) +
  xlab("lifetime")

Interpretation: The addition of MCAK alters the lifetimes of microtubules NOT by changing the rate parameter dramatically, BUT by reducing the shape parameter from ~2.9 to ~1.3. A common interpretation of this is that it reduces a multistep process to nearly a single step process (see above for intuition behind gamma distribution); Exercise for the avid participant: Figure out how Kip3 alters the lifetimes of microtubules.

To the eye, the gamma fits look better than the exponential fits: but are they? Let’s direct our attention now to quantitative methods for model selection:

2 Model selection

There are many ways to select the best model: we want one that rewards models that fit well BUT that also penalizes models with many free parameters (in the spirit of Occam’s razor & due to over-fitting). The Akaike information criterion is an example of such a measure.

2.1 The Akaike information criterion

The Akaike information criterion

\[\text{AIC} = -2\text{ln}(L) + 2K,\]

where \(L\) is the maximum likelihood of the model (i.e., the likelihood function evaluated at the MLE) and \(K\) the number of free parameters in the model.

To currently quote Wikipedia, the preferred model is the one with the minimum AIC value. Hence AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters.

Essential note: the AIC is NOT a statistical test that tells you how good a fit is, it is merely a way to compare a number of models.

Exercise 3 (~5 minutes): Use AIC to compare exponential and gamma models of MT catastrophe. Check out

help(AIC)

Notes:

  1. You can use other criteria, such as Bayesian information criterion (BIC). BIC will generally penalize models more heavily than AIC for complexity. They should generally agree as to which model they select and, if not, this would be the time to consult a mathematical modeler and/or a statistician.
  2. We have NOT covered goodness-of-fit tests for models, such as the \(\chi^2\)-test (also see here), which you should acquaint yourself with, at the very least. Such tests can be expanded to compare models with one another.

Online resources: see here and here for some more information on AIC, BIC & MLE.

Now that we have explored both MLE & model selection, we move onto the basics of linear regression, which will turn out to be a special case of MLE.

3 Linear regression basics (simple)

The challenge: we have some data \(D=\{x_i,y_i\}\) that looks like it could be linear, with some noise:

Performing linear regression in R is super-straightforward:

p <- ggplot( df , aes( x , y )) #initiate plot
p + geom_point() + stat_smooth( method = "lm" ) #plot data w/ linear fit & 95% confidence bands

What have we done here?

We form a (linear) model \(y=ax+b\) and we want to find the coefficients \(a,b\) in the model that fit the data best, whatever that means. The method that most people know (if they know any) and the method that most computing software will implement, if not told to do otherwise, is ordinary least squares. More details will follow after you perform a simple regression.

3.1 Your turn: spindle size as a function of droplet diameter

df <- read.csv("data/spindle_size_MG.csv") #load data
#initialize plotting structure
p <- ggplot( df , aes(x = Droplet_Diameter_um , y = Spindle_Length_um)) 
p + geom_point(alpha = 0.5) #plot with an \alpha (= transparency)

An absurdly concise description of the biological system: Matt Good & co-authors put Xenopus egg extracts in different sized compartments to see whether spindle size was a function of compartment size. See the paper for more details: it’s a great system!

Exercise 4 (~10 minutes):

  1. Load the data as above;
  2. Use the ‘subset’ function to select the data such that the droplet diameter is \(<80\mu\)m;

  3. Plot the data with a linear fit;

Key note: Calling ‘stat_smooth’ will visualize the linear fit but will NOT output the parameters fit (slope & intercept); to achieve this, you will need to do the following:

lin_fit <- lm( Spindle_Length_um ~ Droplet_Diameter_um , df1) #perform fit
summary(lin_fit) #summarize fit
## 
## Call:
## lm(formula = Spindle_Length_um ~ Droplet_Diameter_um, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4556  -2.7181  -0.1319   2.3076  12.7443 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         21.02319    0.65851   31.93   <2e-16 ***
## Droplet_Diameter_um  0.20405    0.01149   17.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.637 on 627 degrees of freedom
## Multiple R-squared:  0.3346, Adjusted R-squared:  0.3336 
## F-statistic: 315.3 on 1 and 627 DF,  p-value: < 2.2e-16

3.2 How linear regression works: ordinary least squares (OLS)

Recall that we form a (linear) model \(y=ax+b\) and we want to find the coefficients \(a,b\) in the model that fit the data best.

For any \(a,b\) we can compute \(\varepsilon_i^2\), the square of distance between the y-coordinate of the data \(y_i\) and the prediction of the model \(ax_i+b\): this distance \(\varepsilon_i\) is called the \(i\)th residual and \(\varepsilon_i^2=(y_i-ax_i+b)^2\). The sum of the squares of the residuals is given by

\[SS_{red} = \sum\limits_i\varepsilon_i^2 = \sum\limits_i(y_i-ax_i+b)^2\]

and the OLS method for estimating \(a,b\) is to find the parameters \(\hat{a},\hat{b}\) that minimize \(SS_{red}\). Notes: (i) in this case of simple (one independent variable) linear regression, there is an analytic solution for the estimates: \(\hat{b} = r_{xy}\times\frac{\text{SD}_y}{\text{SD}_x}\) and \(\hat{a} = \bar{y} - \hat{b}\bar{x}\), where \(r_{xy}\) is the sample correlation coefficient between the \(x_i\) and \(y_i\); (ii) In more complex situations of least-squares curve-fitting, minimization is performed via an algorithm called gradient descent: if you’re interested, check it out.

Key question: can we get error bars on these estimates? The answer is yes and wait to see how. R will readily plot the OLS estimate for you, along with \(95\%\) confidence intervals.

How good is your linear fit?

There are many ways to think about this:

  1. The coefficient of determination \(R^2\) tells you how much of the variance in the dependent variable is explained by the model;
  2. In simple linear regression, \(R^2 = r^2\), where \(r\) is the correlation coefficient;
  3. There are a number of statistical tests that you could use, for example, the \(\chi^2\) (chi-squared) goodness-of-fit test – we shall not cover these here but Rice & Bevington are good references.

The more urgent questions we need to answer are ‘Why and when does this OLS method work?’ and ‘What assumptions underlie this method?’ People (and software!) tend to apply this method blindly and to not realize that their approach may violate any number of its underlying assumptions, rendering it invalid. So when is it appropriate to use the ordinary least squares method and why?

3.2.1 The assumptions underlying OLS: linear regression is nothing more than an example of MLE!

Make the assumption that each \(y_i = ax_i + b + \varepsilon_i\) where \(a\) and \(b\) are real numbers and the \(\varepsilon_i\) are drawn independently from \(N(0,\sigma^2)\), the Gaussian distribution with mean \(0\) and variance \(\sigma^2\):

\[P(\varepsilon_i) = \frac{1}{\sigma\sqrt{(2\pi)}}\text{exp}(-\frac{\varepsilon_i}{2\sigma^2}).\]

Then the OLS estimates \(\hat{a},\hat{b}\) are none other than the maximum likelihood estimates for the parameters \(a,b\) (see Appendix for a proof of this).

ESSENTIAL: In performing OLS, you have assumed:

  • Your residuals are Gaussian with a mean of \(0\);
  • They are independent of one another (i.e., uncorrelated);
  • Neither the mean nor the standard deviation is a function of \(x\).

These points require slightly more explication: if the mean of the residuals in any thin vertical strip is non-zero, we call the residuals biased (otherwise, unbiased); if the variance of the residuals differs as a function of \(x\), we call the residuals heteroscedastic (otherwise, homoscedastic):

Figure demsontrating residual bias and heteroscedasticity

If assumptions are violated, not all is lost!! You just need to be a bit cleverer in how you choose and or fit your model: it could be a problem with the signal (choose different model) OR it could be a problem with the noise (choose different noise, i.e., not OLS).

Note for the eager: you can usually see by eye whether or not your residuals are heteroscedastic. To be more rigorous about this, if there is a lot of data, you can i) divide the x-axis of the residuals plot into bins, ii) compute the variance of the data in each bin & iii) see how much this variance itself varies.

3.2.2 Anscombe’s Quartet demonstrates violations of OLS assumptions

Anscombe’s Quartet consists of four data-sets, each consisting of 11 data points. Each data set has the same \(\bar{x}, \bar{y}, \text{SD}_x , \text{SD}_y\), correlation coefficient and linear regression line (OLS):

What would the residuals look like here?? Check ’em out:

Computing & plotting residuals in R:

Here we plot the residuals for the linear fit of Good et al.’s spindle size data:

lin_fit <- lm( Spindle_Length_um ~ Droplet_Diameter_um , df1) #perform fit
pres <- ggplot( lin_fit , aes( .fitted , .resid) ) #initiate residuals plotting structure
pres + geom_point() #plot it

Exercise 5 (~5 minutes): Discuss the result.

3.2.3 Obtaining confidence intervals on your estimates using bootstrapping

The easiest way to retrieve confidence intervals on your parameter estimates is by bootstrapping (all this means is ‘sampling with replacement’, as we shall see).

The algorithm is as follows:

  1. Perform the fit to get parameter estimates for the model \(y=\hat{f}(x)\);
  2. Bootstrap the residuals (i.e., sample with replacement) and add the new sampled residuals to the model to get another data-set \(D'\) – that is, for every original data point \(\{x_i , y_i \}\), randomly select a residual \(\varepsilon_j\) and generate a new data point \(\{x_i , y_i' \}\), where \(y_i'=\hat{f}(x)+\varepsilon_j\) – do this for every original data point to retrieve a new, synthesized (bootstrapped) data set \(D' = \{x_i , y_i' \}_{i\in I}\);
  3. Perform another fit on the new bootstrapped data to retrieve new parameter estimates \(\hat{a'},\hat{b'}\).
  4. Perform steps 2-3 above \(n\) (commonly \(=1000\)) times. Then we have \(n\) estimates for \(a,b\), giving us a distribution, and we can look at the statistics of these distributions, such standard deviation, variance and confidence intervals.

This is all very straightforward to implement in R (for more, see here). We provide below a biological example of bootstrapping (Good et al.’s linear fit):

library(boot)
# function to obtain regression weights 
bs <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(coef(fit)) 
} 
# bootstrapping with 1000 replications 
results <- boot(data=df1, statistic=bs, 
    R=1000, formula=Spindle_Length_um ~ Droplet_Diameter_um )

# view results
results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df1, statistic = bs, R = 1000, formula = Spindle_Length_um ~ 
##     Droplet_Diameter_um)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 21.0231895  0.0112965374  0.62453364
## t2*  0.2040499 -0.0002943708  0.01144864
plot(results, index=1) # intercept 
plot(results, index=2) # slope
# get 95% confidence intervals 
boot.ci(results, type="bca", index=1) # intercept 
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 1)
## 
## Intervals : 
## Level       BCa          
## 95%   (19.72, 22.19 )  
## Calculations and Intervals on Original Scale
boot.ci(results, type="bca", index=2) # slope
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = results, type = "bca", index = 2)
## 
## Intervals : 
## Level       BCa          
## 95%   ( 0.1821,  0.2271 )  
## Calculations and Intervals on Original Scale

For discussion: Note that when you used ‘lm’ before to perform linear regression, the output included confidence intervals on the estimates – how do the confidence intervals in the bootstrap compare with those from ‘lm’ (this will be a good sanity check)?

4 Nonlinear least squares

4.1 Example: modeling a power law

Power laws \(y = x^n\) occur all throughout biology: see (Kleiber, 1932; Gillooly et al., 2001; White et al., 2003) for examples; they are also the primary mode of thinking about scaling in biology, which is currently very sexy! You can ask questions such as ‘how does spindle size scale with cell volume?’ and ‘how does the energetic cost of the cell cycle scale with cell volume?’, for example. Here, for pedagogical purposes, we synthesize an example in which \(n=0.55\), with added noise:

set.seed(100) #set your seed
n <- 0.55 #define the exponent
x <- seq( 0 , 1000 , by = 2) #independent variable x
err <- rnorm( length(x) , mean = 0 , sd = 1) #normally distributed error
y <- x**n + err #dependent variable y (power-law + noise)
df1 <- as.data.frame( cbind( x , y )) #build relevant dataframe
p <- ggplot( df1 , aes( x , y )) #inititiate plotting structure
p + geom_point() #plot it!

If you have two variables related by a power law and plot the data on log-log axes, the result is linear:

p + geom_point() + scale_x_log10() + scale_y_log10() #plot and log both axes

This is due to our basic log laws: (i) \(\text{log}(ab) = \text{log}(a) + \text{log}(b)\) & (ii) \(\text{log}(a^b) = b\text{log}(a)\). it follows from these that if two variable are related by a power law \(y=ax^n\), then their logs are related linearly: \[\text{log}(y) = \text{log}(a) + n\text{log}(x).\]

Now, in the literature, a lot of people fit power laws by transforming to log axes and then performing a linear fit.

Looking at the residuals we that, in this case, the model wasn’t great as the residuals are heteroscedastic. Instead of fitting a line to the log-transformed data, we should then try to fit a nonlinear curve (power law) to the non-transformed data:

To do so, we use a nonlinear least-squares curve-fitting method (we follow this example: http://www.walkingrandomly.com/?p=5254):

# a starting value
n1 <- 10
#plot the fit
ggplot(data=df1, aes( x , y)) + geom_point() + 
    geom_smooth(method="nls", formula = y ~ x**n, start=list(n = n1) , se=F ,control = list(maxiter = 500),
                size = 1) 

Note that we are now performing a nonlinear least square curve fit (‘nls’). Once again, to get the actual parameters in the fit, we perform the fit in a slightly different manner:

# do the fit
fit = nls(y ~ x**n, start=list(n = n1) , control = list(maxiter = 500))
# summarise
summary(fit)
## 
## Formula: y ~ x^n
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## n 0.5497097  0.0002263    2430   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.005 on 500 degrees of freedom
## 
## Number of iterations to convergence: 68 
## Achieved convergence tolerance: 7.534e-08

Exercise 6 (~10 minutes):

  1. Using ‘nls’, fit a power law to the spindle size data introduced above – report the exponent in the fit, along with error bars;
  2. Use the AIC to compare the linear model with the power law model. Note: in this case, even if the AIC reports the power law model to be better, we need to think hard about which we accept because the linear model is a mechanistic/physical model. See (Good et al., 2013) for further details.

Template for solution:

df <- read.csv("data/spindle_size_MG.csv") #load data 
df1 <- subset( df , Droplet_Diameter_um <80) #subset data to relevant range
##define variables of interest from your dataframe:
y1 <- df1$...
x1 <- df1$...
#perform power law fit:
fitg <- nls(...) #<- fill in arguments to 'nls'
summary(fitg)
##perform linear fit:
lin_fit <- lm( Spindle_Length_um ~ Droplet_Diameter_um , df1)
##compute AIC of both fits:
##WRITE CODE HERE

5 References