Guides
Guides

Running MaAsLin3 Workflow

🚧

A note about running MaAsLin3:

Users are responsible for selecting parameters that best suit their study data and research objectives. While we provide basic recommendations, it is crucial to have a thorough understanding of the MaAsLin3 tool and the statistical methods employed to ensure accurate and reproducible results. We strongly encourage users to familiarize themselves with the tool’s capabilities and limitations to make informed decisions regarding their analysis settings.

Please note that (Beta) indicates the status of the MaAsLin3 pipeline, which is currently in beta stages upon the publish of their pre-print article.


Input Data Preparation

Before running MaAsLin3, you will need to process your .fastq files through one of the taxonomic or functional workflows available within the CosmosID-HUB. You will also need to add metadata to each of your samples to define cohorts needed for comparative analysis. Ensure your metadata encompasses all variables that you want to measure and that may affect your results, with clear groupings for main comparative variables.

Initiating MaAsLin3 Comparative

After profiling, select the samples that you would like to compare in the "Cohorts and Metadata" menu and click "Create CA" to initiate the configuration of your parameters.



Selecting the MaAsLin3 Comparative Workflow

Select one of the following workflows depending on your input data:

  • MaAsLin3 KEPLER-Taxa - conduct differential abundance analysis on Taxa-Kepler results for host agnostic profiling
  • MaAsLin3 CHAMP-Taxa - conduct differential abundance analysis on Taxa-CHAMP results for human-based profiling
  • MaAsLin3 HostAgnostic Functional - conduct differential abundance analysis on Functional 2.0 (MetaCyc, GO Terms, Pfam, CAZy, EnzymeCommission) profiling results
  • MaAsLin3 CHAMP-Functional - conduct differential abundance analysis on GBM/GMM/KEGG results
  • MaAsLin3 16S/ITS are also available for relevant datasets.

Recommended Workflow Parameters

This section outlines suggested settings for MaAsLin3 to optimize differential abundance analysis. Proper configuration ensures accurate results and helps address specific research questions effectively. Use these guidelines to select appropriate fixed and random effects, normalization techniques, and statistical thresholds tailored to your study design. Adjusting these parameters based on your dataset’s characteristics and research objectives will enhance the reliability and interpretability of your findings.

KEPLER-TaxaHostAgnostic FunctionalCHAMP-TaxaCHAMP-Functional
MetricRelative AbundanceCPMRelative AbundanceCellular Abundance
Taxonomic LevelSpeciesNASpeciesNA
Fixed EffectMetadata of InterestMetadata of InterestMetadata of InterestMetadata of Interest
Random EffectNumber of Reads + OptionalVariablesNumber of Reads + OptionalVariablesNumber of Reads + OptionalVariablesNumber of Reads + OptionalVariables
Longitudinal and Complex Association Formula (Optional)See pageSee pageSee pageSee page
Min. Abundance*0.0010.0010.0010.1
Min. Prevalence**0.050.250.050.25
Zero Threshold0000
Minimum Variance0000
Q Threshold0.010.010.010.01
NormalizationNONENONENONENONE
TransformationLOGLOGLOGLOG
Multiple CorrectionBHBHBHBH
Standardization
Augmentation
Median Comparison Abundance Compositional Correction
Median Comparison Abundance Threshold0000
Median Comparison Prevalence Compositional Correction0000
Subtract MedianFALSEFALSEFALSEFALSE
Maximum Number of Associations to Plot50505050
Number of Features in Summary Plot25252525

*A single hit can always be a sequencing error, contamination error, or other noise variables etc. Implementing a minimum abundance with minimum prevalence cutoff can ensure more accurate multi-variate association analysis data.

**Samples are far richer in functional hits than in taxonomic hits. Non-prevalent functions can sharply increase computational time and may include false positives, which is why more stringent cutoffs are recommended.


Building the statistical model (Fixed + Random Effects)

Building the MaAsLin3 model is dependent upon fixed and random effects that can be input as metadata variables (added using the "+ADD" button). Fixed effects are required, while random effects are optional (but recommended for when relevant metadata is available).

These variables define how samples are grouped into comparative cohorts, directly impacting the results. Fixed effects are the primary variables of interest, such as treatment groups or time points, while random effects account for potential confounding factors, like number of reads, age, or grouping variables. Properly defining these effects ensures that your analysis accurately reflects the biological questions you are exploring and minimizes biases in the model.

Fixed Effects

Fixed effects are metadata variables that relate to your hypothesis. These are the primary variables that you are interested in obtaining specific comparative results against. A fixed effect must have 2 or more groups for comparison (we recommend a max of 5 groups).

For example, a fixed effect could be:

  • "Condition" [Control vs. Treatment]
  • "DiseaseState" [Healthy vs. IBD vs. Crohn's]
  • "Time Point" [baseline vs. mid-study vs. endpoint]
  • "Drug Dosage" [25ng vs. 50ng vs. 100ng]
  • "Cancer Stage" [1 vs. 2 vs. 3 vs. 4]

Multiple fixed effects can be selected for comparisons across multiple metadata variables. Results will be presented in the analysis results for every variable/level.

📘

Selecting Effect Variable Parameters (Data Type + Reference Value)

When defining fixed and random effects, it’s essential to select appropriate data types and reference values.

Fixed effects should include categorical variables with clear groups or continuous variables. For categorical data, specify a reference value (if applicable) to serve as a baseline for comparison (e.g., day0, control, healthy). Should a categorical data type be selected without a reference value, MaAsLin3 by default sets the first category in alphabetical order as the reference.

Random effects account for variability due to confounding factors (e.g., subject ID in longitudinal studies, age, cage number). Correctly setting these parameters ensures robust statistical modeling and accurate interpretation of results.


Testing level-versus-level differences, rather than all groups against a reference value

Another feature of MaAsLin3 is the ability to test for level-versus-level differences using ordered predictors and contrast tests. Ordered predictors are categorical variables with a natural ordering such as cancer stage, consumption frequency of a dietary factor, or dosage group.

Say you want to test all fixed effects groups against one another, rather than to the reference value. For example, you want to test Drug Dosage across all levels [25ng vs. 50ng, 50ng vs. 100ng, 25ng vs. 100ng] instead of against baseline [25ng vs. 50ng, 25ng vs. 100ng – this is the default workflow when selecting the reference value].

To do so, you will have to incorperate "ordered(DrugDosage)" into the Longitudinal and Complex Association Formula, and you can read how to do so here. This will perform a contrast test for whether there is a difference between each pair of subsequent levels. The coefficient, standard error, and p-value all correspond to the difference between the level in value and the previous level. Ordered predictors should only be included as fixed effects (i.e., no ordered predictors as random effects etc.).


Random Effects

❗️

Implementing random effects with fewer than 5 observations per group may produce poor model fits and subsequently no significant associations.

👍

Adding Number of Reads as a Random Effect

Because MaAsLin 3 identifies prevalence (presence/absence) associations, sample read depth (number of reads) should be included. Deeper sequencing will likely increase feature detection in a way that could spuriously correlate with metadata of interest when read depth is not included in the model.

Random effects are metadata variables that may not be interested in evaluating but may present variability in the study data. Use these to account for confounding variables that might introduce noise, such as subject-specific differences (e.g., BMI, age). This helps in minimizing bias from uncontrolled factors. Only include relevant confounders to prevent model overfitting.

For example, a random effect could be:

  • BMI
  • Age
  • Cage number
  • Days of antibiotic administration
  • Experimental blocks
  • Longitudinal time points (requires a Group Effect Longitudinal and Complex Association Formula)

📘

Advanced modeling with Longitudinal and Complex Association Formula

You can also utilize the optional Longitudinal and Complex Association Formula to incorporate advanced metadata groupings, such as:

  • Group Effects (such as participant_id for longitudinal studies)
  • Ordered Effects (to cross-compare categorical variables with a natural ordering such as cancer stage, consumption frequency, or dosage group)
  • Strata Effects (a single grouping variable to be included in matched case-control studies)

To learn more, consult the Longitudinal and Complex Associations with MaAsLin3 documentation.


Selecting Other Model Parameters

Taxonomy Level

Options include: Phylum, Class, Order, Family, Genus, Species (recommended), Strain

Select the desired phylogenetic level of calls for differential abundance analysis.

Recommended: Species


Metric

Options for taxonomic data:

  • Relative abundance: percentage abundance per sample (recommended)
  • Normalized reads frequency: counts normalized to the size of their matching genome; not synonymous with the "Normalization" parameter
  • Reads frequency: raw read counts

Options for functional data:

  • CPM (KEPLER): copies per million (normalized counts for functional data) (recommended for KEPLER))
  • Relative abundance (KEPLER): percentage abundance per sample
  • Cellular abundance (CHAMP): represents the relative percentage of species that are able to perform a given function in the microbiome community (recommended for CHAMP)

Workflow

Indicates the version of analysis pipeline for generated data. If only 1 version of the pipeline has been run, there will only be one option selected by default. We recommend running analysis with the most up-to-date version of your pipeline.


See dedicated documentation page linked above.


Minimum Abundance

Features with abundances of at least in of the samples will be included for analysis. The threshold is applied before normalization and transformation.

Set thresholds to filter low-abundance features, reducing noise and focusing on biologically relevant data.

Default: 0.001


Minimum Prevalence Percentage

The minimum percent of samples for which a feature is detected at minimum abundance. Also see "Minimum Abundance" description above.

This parameter helps filter out rare features that may not provide meaningful insights by defining the minimum proportion of samples in which a feature must be present to be included in the analysis. Selecting an appropriate threshold reduces noise and focuses on features that are consistently observed across the study cohort. Typically, a threshold of 5% for taxonomic analysis and 25% for functional analysis is recommended, but this can vary based on the study design and specific research questions.

📘

Prevalence and abundance filtering in MaAsLin2

Typically, it only makes sense to test for feature-metadata associations if a feature is non-zero "enough" of the time. "Enough" can vary between studies, but a 10-50% minimum prevalence threshold is not unusual (and up to 70-90% can be reasonable). Selecting a minimum prevalence filter of 5% will test only features with at least 5% non-zero values.

Similarly, it's often desirable to test only features that reach a minimum abundance threshold in at least this many samples. By default, MaAsLin3 will consider any non-zero value to be reliable, and if you've already done sufficient QC in your dataset, this is appropriate. However, if you'd like to filter more or be conservative, you can set a minimum abundance threshold like min_abundance = 0.001 to test only features reaching at least this (relative) abundance.


Zero Threshold

Default: 0

Abundances less than or equal to zero_threshold will be treated as zeros. This is primarily to be used when the abundance table has likely low-abundance false positives.


Minimum Variance

Default: 0

Features with abundance variances less than or equal to Minimum Variance will be dropped. This is primarily used for dropping features that are entirely zero.


Q Value Significance Threshold

The maximum q-value threshold the be considered significant.

The Q value threshold controls the false discovery rate, helping to reduce false positives in multiple comparisons. A typical threshold is ≤ 0.1, but this can vary depending on the dataset (0.05-0.25). Adjust based on your study’s balance between identifying true associations and minimizing errors.


Normalization

Different normalization techniques adjust for varying sequencing depths or compositional biases:

  • NONE: (default)
  • TSS (Total Sum Scaling): (recommended) normalizes the feature table by dividing each feature by the total sum of features per sample.
  • CLR (Centered Log Ratio): Useful for compositional data, particularly when the data is sparse.
  • CSS (Cumulative Sum Scaling): Suited for sparse count data, normalizing features based on cumulative sums.
  • TMM (Trimmed Mean of M-values): assumes that most taxa are not differentially expressed between samples; computationally intensive for large data sets

Transformation

The transformation to apply to the features after normalization and before analysis. The option LOG is recommended, but PLOG (pseudo-log with a pseudo-count of half the dataset minimum non-zero abundance replacing zeros, particularly for metabolomics data) and NONE can also be used.

Choose transformations depending on the data distribution:

  • LOG, base 2: (default) Apply for skewed data, typically in microbial abundance
  • Logit: For binary data (e.g., presence/absence)
  • AST (Arcsine square root transformation): take the arcsine of the square root of a proportion (a number between 0 and 1), essentially "stretching out" the ends of a distribution to stabilize variance, particularly useful when analyzing data that represents proportions or percentages across different groups
    • Recommended for metabolomics data
  • None: Use with caution

Multiple Hypothesis Correction Method

Correction for multiple testing is a statistical method used to reduce the likelihood of false positive results when performing multiple comparisons or tests. This produces a q value (or False Discovery Rate, FDR), which complements the corresponding p value. When many statistical tests are conducted simultaneously, the chance of finding a significant result purely by chance increases. To address this, correction methods adjust the significance levels to control the overall error rate.

Options include:

  • Benjamini-Hochberg Procedure (BH): (default) This method controls the False Discovery Rate (FDR), which is the expected proportion of false positives among the declared significant results. The p-values are ranked, and a threshold is determined based on the desired FDR.
    • Strengths/Weaknesses: Less conservative than Bonferroni, so it is more likely to detect true positives, making it useful in large datasets. While it reduces the chance of false positives, some false discoveries may still occur.
  • Bonferroni Correction: This is the simplest and most conservative method ideal for small datasets. The p-value threshold is adjusted by dividing the original significance level (α) by the number of comparisons (m). For example, if you are testing 100 hypotheses with a significance level of 0.05, the adjusted p-value threshold would be 0.05/100 = 0.0005.
    • Strengths/Weaknesses: Controls the family-wise error rate (FWER), meaning it reduces the risk of any false positives. However, it can be overly conservative, especially when there are many comparisons, leading to false negatives (missing true associations).
  • Holm's Step-down Procedure: A stepwise version of the Bonferroni method. It adjusts the p-value thresholds sequentially, beginning with the smallest p-value. Each p-value is compared to a threshold based on its rank (smallest p-value gets divided by the number of tests, second smallest by one less, and so on).
    • Strengths/Weaknesses: More powerful than the Bonferroni correction, especially in situations with a smaller number of comparisons. However, it is still fairly conservative, which can reduce power.
  • Hochberg's Step-up Procedure: an adjustment method that is less conservative than the Bonferroni correction but more conservative than the Benjamini-Hochberg procedure. It involves sorting the p-values in ascending order. Starting from the smallest, each p-value is compared to a threshold calculated as α/(m+1−i), where α is the desired overall alpha level, m is the total number of tests, and i is the rank of the current p-value.
    The process stops when a p-value exceeds its corresponding threshold, and all smaller p-values are considered significant.
    • Strengths/Weaknesses: It is more powerful than the Bonferroni correction, especially when dealing with a smaller number of hypotheses. However, it can still be conservative, particularly with a large number of hypotheses, which may reduce its power. Less effective in situations where there is a high dependency among tests.
  • Hommel's Method : controls the family-wise error rate (FWER). It is a generalization of the Bonferroni and Holm's procedures and provides a balance between stringency and power. Similar to Holm’s step-down method, it adjusts p-values but uses a more complex set of calculations to determine the critical values for each step.
    It modifies the thresholds for significance in a way that is adaptive based on the number of tests still under consideration.
    • Strengths/Weaknesses: Generally more powerful than both Holm's and Bonferroni methods, as it adjusts the critical values more finely. Controls the FWER effectively, making it suitable for scenarios where controlling the type I error is crucial. However, like other methods controlling FWER, it may be too conservative when dealing with a large number of hypotheses, potentially leading to an increase in type II errors.
  • Benjamini-Yekutieli Procedure (BY): an extension of the more commonly known Benjamini-Hochberg (BH) false discovery rate (FDR) controlling procedure. It adjusts for the testing of multiple hypotheses while considering the dependency between test statistics, which is often not addressed by other methods including BH. The procedure controls the expected proportion of incorrectly rejected hypotheses (false discoveries) under any dependency structure among the test statistics.
    • Strengths/Weaknesses: Robust against any type of dependency structure among the hypotheses, which is particularly relevant in complex biological data like microbiome studies where variables (e.g., presence of specific taxa) may be interdependent. Provides a more conservative approach than BH, reducing the likelihood of false positives in the presence of dependency among tests. However, it is more conservative than Benjamini-Hochberg, especially in cases of positive dependency, which can reduce statistical power to detect true effects when the dependencies among tests are strong. The increased conservativeness means that BY might not reject as many hypotheses as BH, potentially missing some true associations.

Standardize Numeric/Continuous Metadata Variable

Apply z-score standardization so continuous metadata are on the same scale.

Standardizing a numeric or continuous metadata variable involves transforming the values to have a mean of zero and a standard deviation of one. This process, also known as z-score normalization, allows variables with different units or scales to be compared directly. Standardization is crucial when variables differ significantly in their ranges or units, ensuring that each variable contributes equally to the model and preventing any one variable from disproportionately influencing the analysis. It is commonly used in linear models and clustering algorithms.

📘

Why standardize your numeric variables?

Suppose you have a microbiome dataset where you want to analyze the impact of participants' age and BMI on microbial abundance. Since age and BMI have different scales, standardizing these variables (e.g., converting age from years and BMI from kg/m² to z-scores) ensures that both contribute equally to the analysis. This prevents the model from being disproportionately influenced by the variable with the larger numerical range, allowing for more balanced and interpretable results.


Augment

To avoid linear separability in the logistic regression, at each input data point, add an extra 0 and an extra 1 observation weighted as the number of predictors divided by two times the number of data points. This is almost always recommended to avoid linear separability while having a minor effect on fit coefficients otherwise.


Median Comparison Abundance/Prevalence Compositional Corrections

When "Median Comparison Abundance Compositional Correction" or "Median Comparison Prevalence Compositional Correction" is checked on, the coefficients for a metadatum will be tested against the median coefficient for that metadatum (median across the features). Otherwise, the coefficients will be tested against 0.

For abundance associations, this is designed to account for compositionality: the idea that if only one feature has a positive association with a metadatum on the absolute scale (cell count), the other features will have apparent negative associations with that metadatum on the relative scale (proportion of the community) because relative abundances must sum to 1.

More generally, associations on the relative scale are not necessarily the same as the associations on the absolute scale in magnitude or sign, so testing against zero on the relative scale is not equivalent to testing against zero on the absolute scale. When testing associations on the relative scale, the coefficients should be tested against 0 (median comparison off). However, since these tests do not correspond to tests for associations on the absolute scale, we instead use a test against the median, which can enable some inference on the absolute scale.

There are two interpretations of this test for absolute abundance associations:

  • In linear models without sparsity (or with sparsity under some assumptions), if two features' associations with a particular metadatum on the log absolute scale differ by some value d, the features' associations with that metadatum on the log relative scale (total-sum scaling) will also differ by d. That is, the absolute and relative coefficients for a particular feature-metadatum association are different, but the ordering of the relative coefficients is the same as the ordering of the absolute coefficients for a metadatum, and the difference between two coefficients on the relative scale is the same as the difference between the corresponding coefficients on the absolute scale. Therefore, the test against the relative coefficient median can always be interpreted as "a test of whether a particular association is significantly different from the typical (median) association on the absolute scale."
  • Under the assumption that at least half the features are not changing on the absolute scale, the median true absolute coefficient is 0, so this can be interpreted as a test of whether the feature has a non-zero association on the absolute scale.

By contrast, sparsity should be less affected by compositionality since a feature should still be present even if another increases or decreases in abundance. (Note that, because the read depth is finite, this might not always be true in practice.) Therefore, Median ComparisonPrevalence Compositional Corrections is off by default, but it can be turned on if the user is interested in testing whether a particular prevalence association is significantly different from the typical prevalence association.

In both cases, if the tested coefficient is within Median Comparison Abundance/Prevalence Threshold of the median, it will automatically receive a p-value of 1. This is based on the idea that the association might be statistically significantly different but not substantially different from the median and therefore is likely still a result of compositionality.

To conclude:

  • Median Comparison Abundance Compositional Corrections is activated by default. This is recommended for relative abundance data but should not be used for absolute abundance data. This will enable abundance coefficients to be tested against a null value corresponding to the median coefficient for a metadata variable across the features. Otherwise, test against 0. When this is TRUE, only the p-values and q-values change. The coefficients returned are still the relative abundance coefficients unless Subtract Median is set to TRUE, in which case the median will be subtracted.
    • Median Comparison Abundance Compositional Corrections should be FALSE when (1) testing for significant relative associations, (2) testing for absolute abundance associations under the assumption that the total absolute abundance is not changing, or (3) testing for significant absolute associations when supplying spike-in or total abundances with unscaled_abundance.
  • Median Comparison Prevalence Compositional Corrections is deactivated by default. This is only recommended to be activated if the user is interested in how feature prevalence associations compare to each other or if there is likely strong compositionality-induced sparsity. This tests prevalence coefficients against a null value corresponding to the median coefficient for a metadata variable across the features. Otherwise, test against 0.
  • Median Comparison Abundance Threshold (default 0): Coefficients within this threshold of the median association will automatically be counted as insignificant (p-value set to 1) since they likely represent compositionality-induced associations. This threshold will be divided by the metadata variable's standard deviation if the metadatum is continuous to ensure the threshold applies to the right scale.
  • Median Comparison Abundance Threshold (default 0): Same as the abundance threshold but applied to the prevalence associations.
  • Subtract Median (default FALSE): Subtract the median from the coefficients.

Maximum Number of Associations to Plot

Number of significant associations to plot in the volcano plots

Recommended: 50


Number of features in Summary Plot

Number of features to plot in the summary plot/heatmap

Recommended: 25