Skip to main content

Building Custom Prevalence Models

MaAsLin3 in Cosmos‑Hub simultaneously models both abundance and prevalence associations from the same microbiome feature table, making it one of the most comprehensive tools available for differential microbiome analysis. This guide explains how MaAsLin3 defines prevalence internally, how to export the correct data from Cosmos‑Hub, and how to use that data to build custom logistic regression models for secondary prevalence analyses.
This guide is intended for researchers who have already run MaAsLin3 in Cosmos‑Hub and want to extend their prevalence analysis using external tools such as R or Python. If you have not yet run MaAsLin3, start with the MaAsLin3 overview.

How MaAsLin3 Models Prevalence

MaAsLin3 is a generalized multivariable modeling framework designed to identify microbial associations in complex, high-dimensional datasets. Unlike approaches that model only abundance, MaAsLin3 captures two complementary biological signals from the same input feature table:
  • Prevalencehow often a feature is detected across samples, modeled via logistic regression on a binary presence/absence profile.
  • Abundancehow much of a feature is present when detected, modeled via log‑linear regression on non‑zero abundances.
As described in the MaAsLin3 paper (Nature Methods, 2025):
“MaAsLin 3 takes as input a table of microbial community feature abundances and metadata. These feature data are normalized, filtered, split into prevalence (present versus absent) and log‑transformed nonzero abundances, and fit with a modified logistic model and a linear model, respectively.”
When MaAsLin3 runs in Cosmos‑Hub, the workflow proceeds as follows:
  1. The input feature abundance table (taxa, pathways, or functional features) is normalized — by default using total‑sum scaling to relative abundances.
  2. Optional filtering is applied to remove extremely rare or low-variance features based on minimum prevalence, minimum abundance, and minimum variance thresholds.
  3. The filtered table is split into:
    • A binary prevalence profile (present = 1, absent = 0) for logistic regression.
    • A non‑zero abundance subset (log‑transformed) for linear regression.
This means that the same abundance values in the Input Data export are the source of both the prevalence and abundance models.

Defining Presence for Custom Logistic Regression

A common question when extending MaAsLin3 results is: what threshold should I use to define “presence”? The answer is straightforward. MaAsLin3 defines presence as any non‑zero value in the normalized feature table. There is no alternative recommended cutoff (such as >1e‑4 or >1% relative abundance) in the MaAsLin3 documentation or paper. Sparsity is managed through feature filtering parameters applied before the model, not by changing the definition of presence itself.
Value in Input DataPrevalence CodeInterpretation
> 01Present
= 00Absent
Using this coding ensures your external logistic regression models are conceptually consistent with the prevalence associations reported by MaAsLin3 in Cosmos‑Hub.
If your study has a validated limit of detection (LOD) — for example, from spike‑in calibration or qPCR — you may choose to define presence as “above LOD” rather than strictly ”> 0.” This is a valid study‑specific decision but is not required by MaAsLin3 itself.

Exporting MaAsLin3 Input Data from Cosmos‑Hub

To build a presence/absence matrix that mirrors MaAsLin3, start from the exact feature table used in the Hub run.
1

Open your MaAsLin3 comparative analysis

Navigate to your MaAsLin3 run from the Comparative Analysis dashboard in Cosmos‑Hub.
2

Click Export

Click the Export button in the top‑right of the analysis view. Cosmos‑Hub will generate and download a ZIP archive containing all output files for your MaAsLin3 run.
3

Locate the three key files

Inside the ZIP, you will find:
  • Input Data — A .tsv abundance matrix used as MaAsLin3’s feature input (samples as rows, features as columns). This is the primary file for building presence/absence variables.
  • Input Metadata — A .tsv file with all metadata variables used as covariates and outcomes in the model.
  • Association Results — The full MaAsLin3 output table with model type (abundance vs prevalence), beta coefficients, p‑values, and q‑values (FDR).
The official documentation for MaAsLin3 results and exports is available here:
https://docs.cosmosid.com/docs/maaslin3-view-results

Building a Presence/Absence Matrix

Once you have exported the Input Data file, follow these steps to create a binary presence/absence matrix for logistic regression.
1

Load the Input Data table

Import the Input Data .tsv file into R, Python, or your preferred statistical environment. The table has samples as rows and microbial features (taxa, pathways, etc.) as columns.
2

Recode each feature column to binary

For each feature (column), apply the following rule:
  • Assign 1 if the value is > 0 (present).
  • Assign 0 if the value is = 0 (absent).
3

Apply a minimum prevalence filter

Before fitting logistic models, exclude features that are extremely rare. See the section below on choosing a minimum prevalence threshold.
4

Run logistic regression

Use the binary presence/absence matrix as the response variable and the Input Metadata variables as predictors.
Because you are using the same feature table that MaAsLin3 used, your custom prevalence models will be directly comparable to the “Prevalence” associations shown in the MaAsLin3 Association Results tab and export files.

Choosing a Minimum Prevalence Threshold

Running logistic regression on extremely rare features — those present in only a few samples — leads to unstable model estimates, separation issues, and uninterpretable results. It is therefore recommended to apply a minimum prevalence filter to your feature set before modeling.

How to calculate prevalence

For each feature, compute its prevalence as: Prevalence (%) = [Number of samples where feature is present (> 0)]/[Total number of samples]]
Study sizeSuggested minimum prevalence
Large cohort (n > 100)5% of samples
Medium cohort (n = 50–100)10% of samples
Small cohort (n < 50)10–20% of samples
These thresholds are informed by common MaAsLin‑style workflow guidance, which recommends removing low-prevalence features prior to association testing to improve model stability and interpretability. Reference: Running MaAsLin2 Workflow – Cosmos‑Hub (prevalence filtering guidance is equally applicable to MaAsLin3 downstream analyses).
Setting your minimum prevalence threshold too low (e.g., 1% of samples in a small cohort) may introduce unstable logistic models with inflated or non-convergent estimates. Always check that your included features have sufficient “events” (presence observations) to support the number of predictors in your model.

Example Methods Text

Use the following template for your manuscript or internal SOP methods section, adapted to your study:
“For secondary prevalence analyses, the MaAsLin3 input abundance matrix was exported from Cosmos‑Hub (www.cosmos-hub.com, Cmbio, Germantown, MD). Each microbial feature was coded as present (1) when its input abundance was non‑zero and absent (0) otherwise, consistent with the binary presence/absence framework used by MaAsLin3 for logistic prevalence modeling [cite MaAsLin3 paper]. Features with prevalence below [X]% of samples were excluded prior to logistic regression to reduce model instability driven by extremely rare taxa.”
Replace [X]% with your chosen minimum prevalence threshold (e.g., 5%, 10%).

Frequently Asked Questions

Use the Input Data file. This contains the normalized abundance matrix that MaAsLin3 used as input, from which you can derive binary presence/absence values. The Association Results file contains model outputs (coefficients, p‑values), not the raw feature table.
No. The MaAsLin3 paper and Bioconductor documentation do not recommend any threshold other than non-zero for defining presence. Sparsity is controlled via pre-model filtering parameters (minimum prevalence, minimum abundance), not by adjusting the presence definition.
You can, but only if justified by a study-specific rationale such as a known LOD. It is not consistent with how MaAsLin3 internally defines prevalence, and it may reduce comparability with your MaAsLin3 results.
By default, MaAsLin3 applies minimum prevalence, minimum abundance, and minimum variance filters to remove uninformative features. The exact parameters used in your Cosmos-Hub run are reflected in the filtered Input Data export, so any features that were excluded by MaAsLin3 prior to modeling will not appear in the export.