Generate Circular Cladogram with Heatmap for Taxonomic Data — generate_taxa_cladogram

This function generates a circular cladogram with an integrated heatmap for taxonomic data. It visualizes the phylogenetic relationships between different taxa and their abundances or other coefficients (e.g., from differential abundance testing) across different taxonomic levels using a tree-like structure (cladogram). The heatmap is overlaid on the branches/tips of the cladogram.

Usage

generate_taxa_cladogram_single(
  data.obj,
  test.list = NULL,
  group.var = NULL,
  feature.level,
  feature.mt.method = "none",
  cutoff = 1,
  color.group.level = NULL,
  palette = NULL,
  pdf = FALSE,
  pdf.width = 10,
  pdf.height = 10,
  time.var = NULL,
  t.level = NULL,
  adj.vars = NULL,
  prev.filter = 0.1,
  abund.filter = 1e-04,
  feature.dat.type = "count"
)

Arguments

data.obj: A list object in a format specific to MicrobiomeStat, which includes components like `feature.tab` (feature abundance table), `feature.ann` (feature annotation table, containing taxonomic assignments), `meta.dat` (metadata table), and optionally `tree` (a phylogenetic tree).
test.list: A list of data frames, where each data frame contains test results (e.g., coefficients, P-values) for taxa at a specific taxonomic level. Each element of the list should be named by the taxonomic level (e.g., "Phylum", "Genus"), and within each level, there can be sub-lists for different comparisons (e.g., "groupA_vs_groupB"). If `NULL`, test results will be generated internally using `generate_taxa_test_single` (requires `group.var`, `time.var`, etc., to be set appropriately).
group.var: The name of the grouping variable in `meta.dat` used for comparisons. This is essential if `test.list` is `NULL`, and also used for naming output files/plots if multiple comparisons exist.
feature.level: A character vector specifying the taxonomic levels to be included in the analysis and displayed on the heatmap. The order matters: from coarser (e.g., "Phylum") to finer (e.g., "Species"). The last element is considered the `finest_taxonomic_level` and will typically correspond to the tips of the cladogram if a taxonomic tree is built.
feature.mt.method: Character string specifying the multiple testing correction method to apply to P-values. Options include "fdr" (False Discovery Rate), "bonferroni", "holm", "hochberg", "hommel", "BH", "BY", or "none" (default, no correction). Used for filtering significant features if `test.list` is generated internally or for applying the `cutoff`.
cutoff: Numeric. The significance cutoff (e.g., P-value or adjusted P-value threshold). Taxa with P-values (or adjusted P-values, depending on `feature.mt.method`) greater than this cutoff will have their coefficients set to 0 in the heatmap, effectively marking them as non-significant. Default is 1 (no filtering by significance).
color.group.level: Character string. The taxonomic level used to color-code the branches and tip labels of the cladogram. This level must be one of the levels present in `feature.level`. If `NULL` or not specified, it defaults to the first (coarsest) level in `feature.level`.
palette: An optional character vector of hex color codes to be used for coloring the groups defined by `color.group.level`. If `NULL`, a default color palette will be used.
pdf: Logical. If `TRUE`, the generated plot(s) will be saved as PDF files. File names will be based on the comparison groups. Default is `FALSE`.
pdf.width: Numeric. The width of the PDF file in inches, if `pdf = TRUE`. Default is 10.
pdf.height: Numeric. The height of the PDF file in inches, if `pdf = TRUE`. Default is 10.
time.var: Character string specifying the column name in `meta.dat` that contains the time variable. Used only if `test.list` is `NULL` and `generate_taxa_test_single` is called.
t.level: Character string specifying a particular time point or level of the `time.var` to subset the data for analysis. Used only if `test.list` is `NULL`. If `NULL`, data across all time points is used.
adj.vars: A character vector specifying column names in `meta.dat` to be used as adjustment variables (covariates) in the statistical model. Used only if `test.list` is `NULL`.
prev.filter: Numeric. Minimum prevalence threshold for taxa inclusion. Default is 0.1.
abund.filter: Numeric. Minimum abundance threshold for taxa inclusion. Default is 0.0001.
feature.dat.type: Character. Type of data in feature.tab ("count", "proportion", or "other"). Default is "count".

Value

A list of `ggplot` objects. Each element in the list corresponds to a comparison group (derived from `test.list` or `group.var`) and contains the circular cladogram with its heatmap. If there's only one comparison, the list will contain a single `ggplot` object.

Details

**Taxonomic Name Processing:** Internally, the function standardizes taxonomic names for plotting and matching with the phylogenetic tree. This involves:

Replacing spaces, parentheses `()` with underscores `_`.
Removing periods `.`.

This means that taxon names displayed on the plot might differ slightly from their original representation in `data.obj$feature.ann`. This processing is crucial for matching taxa from the `test.list` with the tips of the phylogenetic tree, especially if the tree is built from taxonomic information. "Unclassified" labels are also processed to ensure uniqueness by appending the taxonomic level and a unique identifier (e.g., "Unclassified_Genus_1").

**Phylogenetic Tree:** The function attempts to use a phylogenetic tree in the following order of preference:

If `data.obj$tree` exists and is a valid `phylo` object with tip labels matching the taxa at the `finest_taxonomic_level` (after name processing), this tree is used and pruned to matching tips.
If no suitable tree is found in `data.obj$tree`, a taxonomic tree is constructed using `ape::as.phylo` based on the hierarchy specified in `feature.level` from `data.obj$feature.ann`. The tips of this tree will correspond to the taxa at the `finest_taxonomic_level`.

**Significance Filtering:** Taxa are filtered based on the `cutoff` value. If `feature.mt.method` is "none", the raw P.Value is used. Otherwise, the Adjusted.P.Value is used. Coefficients of taxa not meeting the significance cutoff are set to 0 for the heatmap display. This filtering can propagate up the taxonomic tree: if all children of a higher-level taxon are non-significant, that higher-level taxon might also be rendered as non-significant.

**`ggtreeExtra::geom_fruit` and `geom_tile`:** This function uses `ggtreeExtra::geom_fruit` to draw the heatmap. Due to non-standard evaluation (NSE) within `ggtreeExtra`, the `geom` parameter is explicitly passed as the string `"geom_tile"`, and `ggplot2::geom_tile` is temporarily assigned to the global environment if not present, to ensure correct rendering. This is a workaround for a known behavior in `ggtreeExtra`.

Examples

if (FALSE) { # \dontrun{
data(subset_T2D.obj)

test.list <- generate_taxa_test_single(
    data.obj = subset_T2D.obj,
    time.var = "visit_number",
    t.level = NULL,
    group.var = "subject_race",
    adj.vars = "subject_gender",
    feature.level = c("Phylum", "Class", "Order", "Family", "Genus", "Species"),
    feature.dat.type = "count",
    prev.filter = 0.1,
    abund.filter = 0.0001,
)

plot.list <- generate_taxa_cladogram_single(
  data.obj = subset_T2D.obj,
  test.list = test.list,
  group.var = "subject_gender",
  feature.level = c("Phylum", "Class", "Order", "Family", "Genus", "Species"),
  feature.mt.method = "none",
  cutoff = 0.9,
  color.group.level = "Order"
)

test.list <- generate_taxa_test_single(
    data.obj = subset_T2D.obj,
    time.var = "visit_number",
    t.level = NULL,
    group.var = "subject_race",
    adj.vars = "subject_gender",
    feature.level = c("Order"),
    feature.dat.type = "count",
    prev.filter = 0.1,
    abund.filter = 0.0001,
)

plot.list <- generate_taxa_cladogram_single(
  data.obj = subset_T2D.obj,
  test.list = test.list,
  group.var = "subject_gender",
  feature.level = c("Order"),
  feature.mt.method = "none",
  cutoff = 0.9,
  color.group.level = "Order"
)

data(peerj32.obj)

test.list <- generate_taxa_test_single(
    data.obj = peerj32.obj,
    time.var = "time",
    t.level = NULL,
    group.var = "group",
    adj.vars = "sex",
    feature.level = c("Phylum","Family","Genus"),
    feature.dat.type = "count",
    prev.filter = 0.1,
    abund.filter = 0.0001,
)

plot.list <- generate_taxa_cladogram_single(
  data.obj = peerj32.obj,
  test.list = test.list,
  group.var = "group",
  feature.level = c("Phylum", "Family", "Genus"),
  cutoff = 0.3,
  color.group.level = "Family"
)
} # }