KEPLER Results
Sample Table Column Descriptions
Name | The name of the organism (at strain, species, or higher taxonomic level) |
Tax ID | Link to the NCBI taxonomic identifier (Tax ID) for the organism. |
GTDB ID | Link to the Genome Taxonomy Database (GTDB) identifier for bacterial strain, where available. |
Relative Abundance | Proportion of a given taxon/feature within the total microbial community detected. Expressed in % and calculated as follows: Relative Abundance = Normalized Reads Frequency (taxon n) / Sum of Abundance Score (all taxa). This metric is most suitable for downstream comparative analysis. |
% Total Matches | A metric describing how well the total (shared + unique) biomarkers for a strain are detected in the sample. |
% Unique Matches | A metric describing how well the strain-specific biomarkers are detected in the sample. |
Reads Frequency | A probabilistic estimation of number of reads aligning to a respective genome. |
Normalized Reads Frequency | Reads Frequency normalized by reference genome size. Used to calculate the Relative Abundance. This ensures differences in read counts are not due to larger/smaller genomes and is suitable for downstream comparative analysis or differential abundance analysis. |
Calls at different taxonomic levels
In your results you may see some calls at sub-species or strain level and others at species or genus level or higher. CosmosID calls the lowest level possible based on the identification of unique kmers that match the reference genomes. If identification was not possible at strain level, for example, we try to make an identification at species level.
% Total vs. Unique Matches
In KEPLER's taxonomy analysis, % Total and % Unique Matches provide insights into how well a strain’s biomarkers are detected in a sample. However, these values do not directly measure confidence in taxonomic classification beyond the strain level. Instead, they describe the extent to which the sample matches reference biomarkers in the database.
For precise strain-level calls: a 1% Unique Match cutoff is recommended, while KEPLER's "Filtered" results already include high-confidence species assignments.
Definitions:
% Total Matches is the number of shared+unique kmer matches identified in the sample divided by the total # of pre-calculated shared+unique matches available in the reference database. Shared kmers are shared with similar organisms of the same lineage in the phylogenetic tree, which makes this a useful metric for approximating gene coverage, or for assessing the likelihood of a detected taxon, or taxa actually being present.
- A higher % Total Matches suggests that many biomarkers associated with the strain are present, but this includes shared biomarkers found in closely related strains.
- A low % Total Matches does not necessarily indicate misassignment—it may simply mean that the strain is at low abundance in the sample.
- A strain could have a high % Total Matches due to many shared markers, but without sufficient unique markers, it may not be distinguishable from other closely related strains.
% Unique Matches is the proportion of unique biomarkers for a strain that were detected in the sample, compared to the total unique biomarkers for that strain in the reference database.
Unique biomarkers are genetic signatures found only in that strain and absent in other strains of the same species.
- A higher % Unique Matches suggests strong strain-level specificity since more strain-specific markers are detected. However, this does not directly indicate high confidence at broader taxonomic levels (e.g., species or genus) because related strains may share large portions of their genomes.
- If you also see a high percent TOTAL coverage for this same strain, this is a good indication that a near taxonomic neighbor of the reference strain is in your sample.
- Low % Unique Matches could mean that the exact strain is not well-represented in the database or that the sample contains a closely related strain.
- For strains, this could indicate that the exact strain in our reference database is not a perfect match to the strain in the sample. However, if you also see a high percent TOTAL coverage for this same strain, this is a good indication that a near taxonomic neighbor of the reference strain is in your sample.
- For some organisms that have high representation in sequencing space (E. coli, for example), there may not be many unique areas in the genome available for strain level identification. As more and more similar genomes are added to the database, we trade-off the ability to discriminate between those that are highly similar to each other. This will cause the percent unique coverage to be low.
This metric is not recommended for downstream comparative analysis.
This metric is used by the Filtered setting to rule out likely false positive calls.
A common use of % Total Matches is for comparative analysis of antibiotic resistance genes or virulence factors, as these databases are gene-based rather than organism-based. The number of total k-mers identified enables a meaningful in understanding how well the gene has been covered by the reads in the sample.
Use the % Total Matches metric to compare how coverage changes between samples, and use the Relative Abundance metric to compare how the composition of marker genes changes between samples.
Updated 8 days ago