Multi-Strain Detection
The Advanced User Feature: Multi-strain Detection
Description
A specific set of filter options is made available upon request to advanced users of the CosmosID Metagenomics Cloud app.cosmosid.com.
This page describes how this feature can be used, which use cases are valid and why, and which limitations need to be considered by the user.
Limited use cases
The below list includes examples of valid use cases for the Multi-strain Detection feature:
- Analysis of rationally designed consortia containing more than one strain of the same species
- Analysis of microbiome standards containing more than one strain of the same species
- Analysis of selective or enrichment cultures expected to contain more than one strain of the same species
How to use the Advanced User Feature: Multi-strain Detection
With Multi-Strain Detection activated, options for "Total-MultiStrain" and "Filtered-MultiStrain" can be navigated in the "Filterset" button.

Background
What metagenomic classification tools do well
Metagenomic classification tools excel at detecting and quantifying microbes in mixed microbial datasets generated for instance from microbiome or environmental samples.
The validated tools and databases by Cmbio deliver microbiome profiles with industry leading performance in terms of taxonomic resolution (down to strain), highest sensitivity and precisions and lowest false positive rates. You can find out more about how CosmosID performs compared to other classifiers here and here. This performance advantage is achieved by the uniquely curated databases of phylogenetically organized biomarkers (kmers), which include fewer unspecific biomarkers than conventional kmer databases. The unique database structure further allows the use of special filters to reliably determine and remove false positive calls.
When inspecting microbiome profiles, users can switch between Filtered and Total results using the Filterset menu in the app (https://docs.cosmosid.com/docs/filtering). The filter thresholds are chosen based on machine learning-derived patterns of how frequently unique kmers matching an organism’s accessory genome vs. shared kmers matching the core genome of its lineage are observed in a given dataset.
What metagenomic classification tools are not designed to do
Metagenomic classification tools are designed to identify microbes by comparing NGS reads to genomes or to biomarkers in a database. They are however not designed to capture, but instead to ignore, population genomic effects within a strain’s population.
At this point some biological background is important.
In nature, microbes that are very closely related to each other (e.g. bacterial strains of the same species) tend to occupy the same ecological niche when given the opportunity. Or in other words, due to their extremely similar genomes they are forced to compete for the same set of limited and unlimited resources. Yet, even marginal differences in genome sequence between such strains can lead to differences in genome replication and cell division rate (or what microbial ecologists call “reproductive fitness). Given time, the strain with a higher reproductive fitness will sweep through the population, replacing it’s close relatives more with every cell division.
For this reason it is very rare to find in environmental or microbiome-derived samples multiple strains of the same species or with extremely similar genomes co-occuring both at significant abundance levels at the same time.
Another inconvenient truth from the perspective of the bioinformatician involved in metagenomic analysis is that the genomes we have in our databases may in fact only be an approximation of any strain’s actual genome. Given constant mutation rates, evolutionary principles apply. Small random changes in the genomes of theoretically clonal cell populations arise and disappear as they are selected for or against by the environmental conditions.
For this reason, it is only a matter of probability for a metagenomic classification tool to find a kmer in a subpopulation of strain A, that more perfectly matches the genome for strain B stored in the database. As a result, even widely used kmer based tools (like Kraken) tend to report a long tail of low abundance false positive calls.
The unique phylogenetic organization of CosmosID database in part addresses this problem by allowing for more specific nearest neighbor placements compared to other tools. Together with the filter setting (looking if for any identified unique kmer, also other unique and shared kmers can be found) CosmosID turns what is a problem for most tools into a non issue.
What does this mean for multistrain detection using Cosmos-Hub?
Conventional kmer based tools like Kraken may detect multiple strains per species, but they are reported indistinguishably mixed together with a long list of false positive calls.
At CosmosID we enable multistrain detection by modifying the already mentioned filter settings that prevent false positives. Somewhat relaxed settings will detect multiple strains per species, but at a price:
Due to the aforementioned population genomic effects, sometimes the tool will falsely report the presence of two strains of the same species where only one is present.
Also in cases where multiple strains of the same species are detected the strain level abundance determinations tends to be overestimated.
This is why we offer this option only to advanced users and only for very specific use cases.
Caveats/Limitations
- It is possible that for a given species, two strains are reported when really only one is present or predominant.
This is due to the fact that population genomic effects within this strain lead to a small population of diverging kmers that better match database entries for an alternative strain. If both reported strains are in fact closely related, then the unique kmer calls are backed up by the presence of shared kmers and both calls end up passing the filter.- Solution: This issue disappears when aggregating results to species level using the CosmosID taxonomy switcher.
- When two or more strains of the same species are reported it is possible that abundance at strain level is overestimated. This would lead to an aggregated abundance for these strains at strain level that is exceeding the abundance that the tool is showing at species level.
- Solution: Also this issue is addressed by using the results at species level, at which the abundance estimates are accurate again. As a consequence, we also limit the CosmosID Comparative Analysis feature for multistrain-data to species level resolution (as comparative analysis depends entirely on accurate abundance estimation).
- Microbiome-derived samples and samples from environmental sources should never be analyzed using the Multi-strain Detection feature. Remember, the analyses of these sample types is the very purpose that metagenomic classification tools are built for. And CosmosID is the best among them!
- Solution: As described above, in nature two extremely closely related taxa (such as strains) should not co-exist at detectable abundance levels. Therefore, please use the Multi-strain Detection feature only for such samples (listed under Limited use cases) in which e.g. due to artificial or experimental conditions multiple closely related strains of the same species co-exist at significant abundance levels.
Updated 3 days ago