
Linear Model for Differential Abundance Analysis with Sample Weighting (Experimental)
Source:R/linda2.R
linda2.RdEXPERIMENTAL: This function extends linda() with sample weighting support. It implements a simple, robust, and highly scalable approach to tackle the compositional effects in differential abundance analysis. It fits linear regression models on the centered log2-ratio transformed data, identifies a bias term due to the transformation and compositional effect, and corrects the bias using the mode of the regression coefficients. It could fit mixed-effect models. Note that linda is developed separately from other MicrobiomeStat functions, so its usage is different.
Usage
linda2(
feature.dat,
meta.dat,
phyloseq.obj = NULL,
formula,
feature.dat.type = c("count", "proportion", "other"),
prev.filter = 0,
mean.abund.filter = 0,
max.abund.filter = 0,
is.winsor = TRUE,
outlier.pct = 0.03,
adaptive = TRUE,
zero.handling = c("pseudo-count", "imputation"),
pseudo.cnt = 0.5,
corr.cut = 0.1,
p.adj.method = "BH",
alpha = 0.05,
n.cores = 1,
verbose = TRUE,
weights = NULL,
tree = NULL,
tree.smooth = FALSE,
tree.lambda = 0.1,
tree.k = 5,
omnibus = TRUE
)Arguments
- feature.dat
A data frame or matrix representing observed feature table (e.g., OTU/ASV table). Rows represent taxa; columns represent samples. NAs are not expected in feature tables so are not allowed in function linda.
- meta.dat
A data frame of covariates. The rows of meta.dat correspond to the columns of feature.dat. NAs are allowed. If there are NAs, the corresponding samples will be removed in the analysis.
- phyloseq.obj
A phyloseq object (optional). If provided, the feature.dat and meta.dat will be extracted from this object.
- formula
Character. For example: formula = '~x1*x2+x3+(1|id)'. At least one fixed effect is required.
- feature.dat.type
Character. Specifies the type of the data in feature.dat. Options are "count" (default), "proportion" or "other". If "count", the data will be treated as count data and undergo zero-handling. If "proportion", the data will be treated as compositional data and undergo half minimum imputation for zeros. If "other", all filters (max.abund.filter, mean.abund.filter, and prev.filter) will be reset to 0.
- prev.filter
A real value between 0 and 1; taxa with prevalence (percentage of nonzeros) less than prev.filter are excluded. Default is 0 (no taxa will be excluded).
- mean.abund.filter
A real value; taxa with mean abundance less than mean.abund.filter are excluded. Default is 0 (no taxa will be excluded).
- max.abund.filter
A real value; taxa with max abundance less than max.abund.filter are excluded. Default is 0 (no taxa will be excluded).
- is.winsor
Boolean. If TRUE (default), the Winsorization process will be conducted for the feature table.
- outlier.pct
A real value between 0 and 1; Winsorization cutoff (percentile) for the feature table, e.g., 0.03. Default is NULL. If NULL, Winsorization process will not be conducted.
- adaptive
Boolean. Default is TRUE. If TRUE, the parameter imputation will be treated as FALSE no matter what it is actually set to be. Then the significant correlations between the sequencing depth and explanatory variables will be tested via the linear regression between the log of the sequencing depths and formula. If any p-value is smaller than or equal to corr.cut, the imputation approach will be used; otherwise, the pseudo-count approach will be used.
- zero.handling
Character. Specifies the method to handle zeros in the feature table. Options are "pseudo-count" or "imputation" (default is "pseudo-count"). If "imputation", zeros in the feature table will be imputed using the formula in the referenced paper. If "pseudo-count", a small constant (pseudo.cnt) will be added to each value in the feature table.
- pseudo.cnt
A positive real value. Default is 0.5. If zero.handling is set to "pseudo-count", this constant will be added to each value in the feature table.
- corr.cut
A real value between 0 and 1; significance level of correlations between the sequencing depth and explanatory variables. Default is 0.1.
- p.adj.method
Character; p-value adjusting approach. See R function p.adjust. Default is 'BH'.
- alpha
A real value between 0 and 1; significance level of differential abundance. Default is 0.05.
- n.cores
A positive integer. If n.cores > 1 and formula is in a form of mixed-effect model, n.cores parallels will be conducted. Default is 1.
- verbose
A boolean; if TRUE, progress messages will be printed. Default is TRUE.
- weights
Optional. Sample weights for the regression models. Can be:
NULL (default): No weighting, standard LinDA analysis
"detection_depth": Automatic Detection Depth Weighting (DDW) using
sqrt(colSums(feature.dat > 0)). This is the recommended approach for handling sample quality heterogeneity.Character: Name of a column in
meta.datcontaining sample weightsNumeric vector: Sample weights (length must equal
ncol(feature.dat))
Technical Note on DDW: We use
sqrt(detection_depth)rather than raw detection depth because:Raw detection depth weights cause standard error underestimation (~7.5%)
This leads to inflated Type I error (7.7% vs target 5%)
Square root transformation provides better FDR control while preserving power gains
Important Notes:
Weighting affects both coefficient estimation and standard errors
Weights are normalized to sum to the number of samples
DDW is most beneficial when samples have heterogeneous detection depths (CV > 0.2)
- tree
A phylo object (from ape package) representing the phylogenetic tree. If provided and
tree.smooth = TRUE, tree-guided smoothing will be applied to boost statistical power. Default is NULL (no tree smoothing).- tree.smooth
Logical; if TRUE and a tree is provided, apply phylogenetic smoothing to the test statistics. This borrows strength from evolutionarily related taxa to improve power. Default is FALSE.
- tree.lambda
Numeric; smoothing strength parameter for tree-guided smoothing. Larger values = more smoothing. Recommended range: 0.05-0.2. Default is 0.1.
- tree.k
Integer; number of nearest neighbors for local smoothing. Using fewer neighbors prevents signal leakage. Default is 5.
- omnibus
Logical; if TRUE and tree smoothing is enabled, compute Max-T omnibus P-values that combine baseline and smoothed results. The omnibus achieves the "envelope" effect: it captures the advantage of smoothing for clustered signals while not losing power for random signals. Default is TRUE when tree.smooth is enabled.
Value
A list with the elements
- variables
A vector of variable names of all fixed effects in
formula. For example:formula = '~x1*x2+x3+(1|id)'. Supposex1andx2are numerical, andx3is a categorical variable of three levels: a, b and c. Then the elements ofvariableswould be('x1', 'x2', 'x3b', 'x3c', 'x1:x2').- bias
numeric vector; each element corresponds to one variable in
variables; the estimated bias of the regression coefficients due to the compositional effect.- output
a list of data frames with columns 'baseMean', 'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj', 'reject', 'df';
names(output)is equal tovariables; the rows of the data frame corresponds to taxa. Note: if there are taxa being excluded due toprev.cut, the number of the rows of the output data frame will be not equal to the number of the rows offeature.dat. Taxa are identified by the rownames. If the rownames offeature.datare NULL, then1 : nrow(feature.dat)is set as the rownames offeature.dat.baseMean: 2 to the power of the intercept coefficients (normalized by one million)
log2FoldChange: bias-corrected coefficients
lfcSE: standard errors of the coefficients
stat: log2FoldChange / lfcSE
pvalue: 2 * pt(-abs(stat), df)
padj: p.adjust(pvalue, method = p.adj.method)
reject: padj <= alpha
df: degrees of freedom. The number of samples minus the number of explanatory variables (intercept included) for fixed-effect models; estimates from R package
lmerTestwith Satterthwaite method of approximation for mixed-effect models.
- feature.dat.use
the feature table used in the abundance analysis (the
feature.datafter the preprocessing: samples that have NAs in the variables informulaor have less thanlib.cutread counts are removed; taxa with prevalence less thanprev.cutare removed and data is winsorized if!is.null(winsor.quan); and zeros are treated, i.e., imputed or pseudo-count added).- meta.dat.use
the meta data used in the abundance analysis (only variables in
formulaare stored; samples that have NAs or have less thanlib.cutread counts are removed; numerical variables are scaled).
References
Huijuan Zhou, Kejun He, Jun Chen, and Xianyang Zhang. LinDA: Linear Models for Differential Abundance Analysis of Microbiome Compositional Data.
Author
Huijuan Zhou [email protected] Jun Chen [email protected] Maintainer: Huijuan Zhou
Examples
if (FALSE) { # \dontrun{
library(ggrepel)
data(smokers)
ind <- smokers$meta$AIRWAYSITE == "Throat"
otu.tab <- as.data.frame(smokers$otu[, ind])
meta <- cbind.data.frame(
Smoke = factor(smokers$meta$SMOKER[ind]),
Sex = factor(smokers$meta$SEX[ind]),
Site = factor(smokers$meta$SIDEOFBODY[ind]),
SubjectID = factor(smokers$meta$HOST_SUBJECT_ID[ind])
)
ind1 <- which(meta$Site == "Left")
res.left <- linda(otu.tab[, ind1], meta[ind1, ],
formula = "~Smoke+Sex", alpha = 0.1,
prev.filter = 0.1
)
ind2 <- which(meta$Site == "Right")
res.right <- linda(otu.tab[, ind2], meta[ind2, ],
formula = "~Smoke+Sex", alpha = 0.1,
prev.filter = 0.1
)
rownames(res.left$output[[1]])[which(res.left$output[[1]]$reject)]
rownames(res.right$output[[1]])[which(res.right$output[[1]]$reject)]
linda.obj <- linda(otu.tab, meta,
formula = "~Smoke+Sex+(1|SubjectID)", alpha = 0.1,
prev.filter = 0.1
)
} # }