Negative binomial models

You might notice that we use the negative binomial (NB) model quite a lot in BIRDMAn. Here we will briefly explain why we use this distribution for modelling microbiome data. Do note, however, that BIRDMAn is flexible to custom models implementing other distributions such as Dirichlet-multinomial, Poisson-lognormal, etc.

Count models

When we are working with microbiome sequencing (or really most other ‘omics data types), our results are typically counts. As such, we can’t use simple linear regression (which includes fractions and negative numbers) to model the number of each microbe in each sample. We instead want to use a statistical distribution that is explicitly defined on the set of whole numbers. The simplest of such models is the Poisson distribution.

The Poisson distribution has a single parameter, \(\lambda\) that defines the rate of “events”. In this distribution, the mean and variance are the same, with both being equal to \(\lambda\). This poses a problem for microbiome data, since the variance is typically much greater than the mean. For more on this, see McMurdie & Holmes 2014.

Negative binomial model

Enter the negative binomial model. The negative binomial distribution can be thought of as a Poisson with an allowance for extra variance. This is useful for microbiome data, as we can account for the count nature of the data while accomodating inflated variance.

We use the following parameterization from Stan where \(i\) represents a sample and \(j\) represents a microbe.

\[y_{ij} = \textrm{NegativeBinomial}(\mu_{ij}, \phi_j)\]

In this model, \(mu_{ij}\) represents the mean count of this microbe in this sample and \(\phi_j\) represents the dispersion parameter for this microbe. This model is useful because we can model one or both parameters hierarchically according to a generalized linear model with log-link function.

For example, if we expect that the mean abundance depends on whether the sample is a case or control, we can add a parameter and fit it using BIRDMAn.

\[\ln(\mu_{ij}) = \beta_{0j} + \beta_{1j} x_i + \ln(\textrm{Depth}_i)\]

In this equation, \(\beta_0\) is the intercept term. This term is the average (log) control sample proportion. For more on this see this blogpost.

Here, \(x_i\) is a binary vector representing whether sample \(i\) is a case sample (1) or a control sample (0). Thus, \(\beta_1\) represents the log-fold change of microbial abundance between cases and controls.

Finally, we account for the fact that sampling depth differs across samples by adding a correction term of the log depth of the modeled sample.

Note

This is a pretty basic equation but we can modify it depending on our experimental design or specific questions. For example we could add more covariates, subject effects, or even model the dispersion as a sample dependent parameter!

Output

When we fit this model using BIRDMAn, we get distributions of plausible parameter values given our data and priors. For a single microbe, we typically are interested in \(\beta_0\), \(\beta_1\), and \(\phi\). When we fit all of the microbes, we end up with \(D\) distributions for each of these parameter where D is the total number of microbes in the table.

You can imagine that if we, for example, modeled dispersion as a function of both sample and microbe, we would get \(N \times D\) distributions where \(N\) is the number of samples in our table.