Parallelization in BIRDMAn through the Single Feature Model¶
There are two ways you can implement a model in BIRDMAn: as a Table Model or as a Single Feature Model.
In the Table Model method, we initialized one model fitting the entirety of our data. For an example of implementation with the TableModel
class, see our documentation implementing a custom model.
In the Single Feature method, we break down the fitting for each microbe with the SingleFeatureModel
class. Each microbe has its own model fit, the Single Feature approach allows for computational parallelization. Instead of running the model for each microbe serially as in the Table Model method, this process can be optimized to run the model for multiple microbes simultaneously, speeding up your BIRDMAn workflow.
To illustrate a walk through of the Single Feature method, we will use the same data from our implementing a custom model example. Check out the custom model tutorial for background details on downloading the study data, filtering the metadata, and filtering the feature table. Or, download and save our the filtered metadata_model.csv
and feature table filt_tbl.biom
.
Stan code¶
We will save the below file to single_negative_binomial_re.stan
so we can import and compile it in BIRDMAn. Here, our model code remains similar to the Table Model stan code negative_binomial_re.stan
in the custom model tutorial. The key difference is that we no longer require dimensions with the Single Model approach as we did in the Table model.
data {
int<lower=1> N; // number of sample IDs
int<lower=1> num_subjs; // number of groups (subjects)
int<lower=1> p; // number of covariates
real A; // mean of intercept prior
vector[N] log_depths; // sequencing depths of microbes
matrix[N, p] x; // covariate matrix
array[N] int<lower=0, upper=num_subjs> subj_map; // mapping of samples to subject IDs
array[N] int y; // observed microbe abundances
real<lower=0> B_p; // stdev for covariate beta normal prior
real<lower=0> inv_disp_sd; // stdev for inverse dispersion lognormal prior
real<lower=0> re_p; // stdev for subject intercept normal prior
}
parameters {
real<offset=A, multiplier=2> beta_0; // intercept parameter
vector[p-1] beta_x; // parameters for covariates
real<lower=0> inv_disp; // inverse dispersion parameter
vector[num_subjs] subj_re; // subject intercepts
}
transformed parameters {
vector[p] beta_var = append_row(beta_0, beta_x);
vector[N] lam = x*beta_var + log_depths;
for (n in 1:N) {
// add subject intercepts
lam[n] += subj_re[subj_map[n]];
}
}
model {
// Specify priors
beta_0 ~ normal(A, 2);
for (i in 1:p-1) {
beta_x[i] ~ normal(0, B_p);
}
subj_re ~ normal(0, re_p);
inv_disp ~ lognormal(0, inv_disp_sd);
// Fit model
y ~ neg_binomial_2_log(lam, inv(inv_disp));
}
generated quantities {
vector[N] y_predict; // posterior predictive model
vector[N] log_lhood; // Evaluate log-likelihood of samples from posterior
for (n in 1:N) {
y_predict[n] = neg_binomial_2_log_rng(lam[n], inv(inv_disp));
log_lhood[n] = neg_binomial_2_log_lpmf(y[n] | lam[n], inv(inv_disp));
}
}
Getting started¶
Import the following libraries. If you already have birdman
installed, import the SingleFeatureModel
class. Set your directories to the metadata, biom table, and model code.
import os
from tempfile import TemporaryDirectory
import click
from birdman import SingleFeatureModel, ModelIterator
import biom
import numpy as np
import pandas as pd
PROJ_DIR = "/path/to/proj/dir"
MD = pd.read_table(f"{PROJ_DIR}/path/to/metadata_model.csv", sep=",", index_col=0)
TABLE_FILE = f"{PROJ_DIR}/path/to/filt_tbl.biom"
MODEL_PATH = f"{PROJ_DIR}/path/to/single_negative_binomial_re.stan"
TABLE = biom.load_table(TABLE_FILE)
FIDS = TABLE.ids(axis="observation")
Define SingleFeatureModel
class¶
We will now pass this file along with our table, metadata, and formula into BIRDMAn. Note that we are using the base SingleFeatureModel
class for our model. We inherit this class to build our own custom model.
class HelminthModelSingle(SingleFeatureModel):
def __init__(
self,
table: biom.Table,
feature_id: str,
beta_prior: float = 2.0,
inv_disp_sd: float = 0.5,
subj_prior: float = 2.0,
chains: int = 4,
num_iter: int = 500,
num_warmup: int = 1000,
):
super().__init__(
table=table,
feature_id=feature_id,
model_path=MODEL_PATH,
chains=chains,
num_iter=num_iter,
num_warmup=num_warmup,
)
# Create a mapping of sample to subject
# Start at 1 because Stan 1-indexes
subj_series = MD["host_subject_id"].loc[self.sample_names]
samp_subj_map = subj_series.astype("category").cat.codes + 1
self.subjects = np.sort(subj_series.unique())
# Create the design matrix
formula="C(time_point, Treatment('post-deworm'))"
self.create_regression(formula, MD)
# Assume intercept prior is log-average proportion based on total number of features
D = table.shape[0]
A = np.log(1 / D)
param_dict = {
"log_depths": np.log(TABLE.sum(axis="sample")),
"A": A,
"num_subjs": len(self.subjects),
"subj_map": samp_subj_map.values,
"B_p": beta_prior,
"inv_disp_sd": inv_disp_sd,
"re_p": subj_prior
}
self.add_parameters(param_dict)
# Specify the parameters you want to keep and dimensions
self.specify_model(
params=["beta_var", "inv_disp", "subj_re"],
dims={
"beta_var": ["covariate"],
"subj_re": ["subject"],
"log_lhood": ["tbl_sample"],
"y_predict": ["tbl_sample"]
},
coords={
"covariate": self.colnames,
"tbl_sample": self.sample_names,
"subject": self.subjects
},
include_observed_data=True,
posterior_predictive="y_predict",
log_likelihood="log_lhood"
)
Chunking¶
Now that we have created our model class, we want to write a script to assist us with parallelization. BIRDMAn provides a convenience class, ModelIterator
, to simplify fitting multiple features at once. This class allows us to “chunk” our feature table so that we can run subsets of the table on different cores.
When you pass in the number of chunks, BIRDMAn automatically determines how many features per chunk and creates a list of lists of tuples. For example, if we have a table of 8 features that we want to split into three chunks, ModelIterator
would subset the table as follows:
[
[(feature_1, model_1), (feature_2, model_2), (feature_3, model_3)],
[(feature_4, model_4), (feature_5, model_5), (feature_6, model_6)],
[(feature_7, model_7), (feature_8, model_8)]
]
What this means for us is that we can write a script that takes in the total number of chunks and current chunk number and fits all the features in that chunk. With a HPC such as SLURM or Torque, we can use multiple cores to significantly speed up the time to fit all features! We will call this Python script run_birdman_chunked.py
. We will use the click package to create a simple CLI.
@click.command()
@click.option("--inference-dir", required=True)
@click.option("--num-chunks", required=True)
@click.option("--chunk-num", required=True)
@click.option("--chains", default=4)
@click.option("--num-iter", default=500)
@click.option("--num-warmup", default=1000)
@click.option("--beta-prior", default=2.0)
@click.option("--inv-disp-sd", default=0.5)
@click.option("--re-prior", default=2.0)
def run_birdman(
inference_dir,
num_chunks,
chunk_num,
chains,
num_iter,
num_warmup,
beta_prior,
inv_disp_sd,
re_prior,
):
model_iter = ModelIterator(
TABLE,
HelminthModelSingle,
num_chunks=num_chunks,
beta_prior=beta_prior,
inv_disp_sd=inv_disp_sd,
subj_prior=re_prior,
chains=chains,
num_iter=num_iter,
num_warmup=num_warmup
)
# Get chunk number - array job starts at 1 so we subtract for indexing
chunk = model_iter[chunk_num - 1]
for feature_id, model in chunk:
# Specify a temporary directory for temporary files during model fitting
tmpdir = f"{inference_dir}/tmp/{feature_id}"
os.makedirs(tmpdir, exist_ok=True)
# Specify the output file
outfile = f"{inference_dir}/{feature_id}.nc"
with TemporaryDirectory(dir=tmpdir) as t:
model.compile_model()
model.fit_model(sampler_args={"output_dir": t})
inf = model.to_inference()
inf.to_netcdf(outfile)
if __name__ == "__main__":
run_birdman()
Running our parallel script¶
With our Python script, we can now write a simple script that tells our cluster how to chunk our workflow and allocate resources. For this tutorial we will be using SLURM but there should be equivalent procedures in other schedulers. The --array=1-20
line indicates that we want to create 20 chunks of our table.
Note
You should compile your custom Stan file before you run this script. Otherwise, each instance will try to compiile the model and run into issues. You can do this easily with cmdstanpy.CmdStanModel(stan_file="single_negative_binomial_re.stan")
.
#!/bin/bash
#SBATCH --mem=8G
#SBATCH --nodes=1
#SBATCH --partition=short
#SBATCH --cpus-per-task=4
#SBATCH --time=6:00:00
#SBATCH --array=1-20
echo Chunk $SLURM_ARRAY_TASK_ID / $SLURM_ARRAY_TASK_MAX
OUTDIR="./inferences"
mkdir $OUTDIR
echo Starting Python script...
python run_birdman_chunked.py \
--inference-dir $OUTDIR \
--num-chunks $SLURM_ARRAY_TASK_MAX \
--chunk-num $SLURM_ARRAY_TASK_ID \
--chains 4 \
--num-iter 500 \
--num-warmup 1000 \
--beta-prior 2.0 \
--inv-disp-sd 0.5 \
--re-prior 2.0 \
If you run this script your scheduler should start up 20 jobs, each one running a different chunk of features!