Here are some definitions of terms we use in this guide.
A kmer is a nucleotide sequence of a certain length. It is common in genomics to select all possible kmers of a fixed length for each read in a sample, for example.
Whole genome shotgun sequencing (wgs) - with this method of DNA sequencing, all microbial DNA in the sample is fragmented into small pieces for next-generation sequencing.
Using wgs sequencing as described above, the CosmosID algorithms identify microorganisms based on the entire genomes of the organisms that are in our database.
Unlike shotgun metagenomics, amplicon (or 16S/ITS) analysis looks only at the relevant gene or genes, not the entire genome for identification.
The name for the organism or taxonomic level or the name of the antibiotic resistance gene or virulence factor
Calls at different taxonomic levels
In your results you may see some calls at sub-species or strain level and others at species or genus level or higher. CosmosID calls the lowest level possible based on the identification of unique kmers that match the reference genomes. If identification was not possible at strain level, for example, we try to make an identification at species level.
The tax id is a link to the NCBI tax_id for the organism
The Abundance Score is an absolute abundance metric. It is used to calculate the Relative Abundance (%). The abundance score is a normalized metric taking into consideration genome size and number of reads. This makes this metric suitable for downstream comparative analysis or differential abundance analysis.
The number of unique kmer occurrences in the queried sample. This is roughly equivalent to the number of reads that matched to the organism identified.
Frequency is useful when you want to see the raw number of kmer hits to unique regions in a genome rather than the normalized representation you see with relative abundance. For example, perhaps your sample has a low relative abundance of Salmonella enterica, such as 0.1%. But you want to understand exactly how many unique kmers are matching that genome. Frequency will tell you that. There are two other important things to understand though. 1) Frequency is only a count of unique kmers. If there are kmers that match that are shared with other genomes they are not reported here. 2) The kmers that are counted for frequency can be redundant, in other words they could come from multiple reads that cover the exact same region of the genome. They also could come from repetitive regions in the genome (ie identical kmers read could have originated from multiple locations in the genome).
You also can use frequency as a parameter for comparative analysis. If you do this you will see absolute numbers for your stacked bar graph (with the ability to switch to relative numbers). Your other comparative analyses will be slightly different with this different input.
Unique matches percent is the amount of unique kmers that were found in your sample for an organism out of the total possible number of kmers in our database that are unique to that organism. If unique coverage is very low for an organism it could mean a couple of things.
For strains, this could indicate that the exact strain in our reference database is not a perfect match to the strain in the sample. However, if you also see a high percent TOTAL coverage for this same strain, this is a good indication that a near taxonomic neighbor of the reference strain is in your sample.
For some organisms that have high representation in sequencing space (E. coli, for example), there may not be many unique areas in the genome available for strain level identification. As more and more similar genomes are added to the database, we lose the ability to discriminate between them as they are so highly similar to each other. This will cause the percent unique coverage to be low.
The shared plus unique matches divided by the pre-calculated shared plus unique matches possible in the reference database.
What is percent total coverage and when would I use it?
TOTAL matches is the total amount of shared + unique kmers found in your sample for an organism out of the total possible number of shared + unique kmers in our database that have been calculated for that reference organism. Shared kmers are shared with other similar organisms as they go up the phylogenetic tree.
A common use of total matches is for comparing antibiotic resistance genes or virulence factors. When you do a comparative analysis for these databases you will notice that this is the default value used for the comparison. Since these databases are gene-based rather than organism-based, looking at the number of total kmers identified out of those possible is meaningful in understanding how much of the gene has been represented in the sample.
Relative abundance is calculated based on the number of organism specific kmers and their observed frequency in the sample and then normalized to represent the abundance of each organism.
Each account is pre-loaded with example datasets. They can be used to see how the app works and for comparative analysis. Here's a brief video that shows how to view example datasets:
Description of datasets
These samples are from the Human Microbiome Project (HMP): Human Microbiome Project Website
The names correspond to the body sites they were isolated from:
attk_ging.fasta: attached keratinized gingivita (oral)
bucmuc.fasta: buccal mucosa (oral)
hardpal.fasta: hard palate (oral)
leftret_crease.fasta: Left retroauricular crease (skin)
midvag.fasta: mid vagina (urogenital tract)
palton.fasta: palatine tonsils (oral)
postfor.fasta: posterior fornix (urogenital tract)
rightret_crease.fasta: right retroauricular crease (skin)
saliva*.fasta: saliva (oral)
stool.fasta: stool (gastrointestinal tract)
subplaq.fasta: subgingival plaque (oral)
supplaq.fasta: supragingival plaque (oral)
tongue_.fasta: tongue (oral)
vagintr*fasta: vaginal introitus (urogenital tract)
The samples can be used with other samples for comparative analysis or just to get an understanding of how the CosmosID app works
Updated about a month ago