FAQ
Sequencing
Do you offer metagenomics and metatranscriptomics sequencing services?
Yes, we offer a full range of services in our CLIA certified lab including DNA extraction, library preparation, and sequencing. You can find more information at cosmosid.com
How much does it cost?
For DNA sequencing and other services, please contact us at [email protected] as pricing is dependent on sequencing depth, number of samples, etc.
Metagenomic Analysis
How should I name my paired end files?
Paired end files must be named with "_R1" or "_R2" followed by either "_001", "_002", etc. followed by the sequencing suffix (.fastq, .fasta, etc.).
Examples of acceptable paired end names:
Seqrun1_L001_R1.fastq.gz
Seqrun1_L001_R2.fastq.gz
Seqrun1_R1_001.fastq.gz
Seqrun1_R2_001.fastq.gz
Can I upload bam files? What about .gz files?
Yes, we accept bam files and all of the following file extensions:
fasta, .fna, .fasta.gz, .fastq, .fq, .fastq.gz, bam, bam.gz, sra, sra.gz
If I have a lot of host DNA in my samples will you be able to identify microbes?
Yes! Our algorithms only evaluate the microbial DNA in your samples and ignore host DNA based on your host-selection during the upload step.
What’s the difference between 16S and wgs or shotgun metagenomics?
shotgun metagenomics - using whole genome shotgun sequencing, the CosmosID algorithms identify microorganisms based on the entire genomes of the organisms.
amplicon/16S/ITS - unlike shotgun metagenomics, amplicon (or 16S/ITS) analysis looks only at the relevant ribosomal RNA gene or genes, not the entire genome for identification.
There are advantages and disadvantages to each method.
Shotgun Metagenomics provides more sensitivity, higher resolution, and more detailed information with sub-species or strain level identification. With shotgun metagenomics we are able to provide results from multiple kingdoms (bacteria, viruses, protista, and fungi) and we can identify antibiotic resistance genes and virulence factors. The main disadvantage is that the wgs sequencing is usually more expensive per sample than amplicon sequencing.
Amplicon/16S/ITS has limitations in its ability to identify organisms to taxonomic levels lower than species; it often can only classify organisms at the genus level or higher. The sequencing for amplicon analysis is generally less expensive than shotgun metagenomic sequencing.
CosmosID-HUB Usage
When would I use frequency?
Frequency is useful when you want to see the raw number of kmer hits to unique regions in a genome rather than the normalized representation you see with relative abundance. For example, perhaps your sample has a low relative abundance of Salmonella enterica, such as 0.1%. But you want to understand exactly how many unique kmers are matching that genome. Frequency will tell you that. There are two other important things to understand though. 1) Frequency is only a count of unique kmers. If there are kmers that match that are shared with other genomes they are not reported here. 2) The kmers that are counted for frequency can be redundant, in other words they could come from multiple reads that cover the exact same region of the genome. They also could come from repetitive regions in the genome (ie identical kmers read could have originated from multiple locations in the genome).
You also can use frequency as a parameter for comparative analysis. If you do this you will see absolute numbers for your stacked bar graph (with the ability to switch to relative numbers). Your other comparative analyses will be slightly different with this different input.
What is unique matches?
UNIQUE matches is the amount of unique kmers that were found in your sample for an organism out of the total possible number of kmers in our database that are unique to that organism. If unique matches are very low for an organism it could mean a couple of things.
- For strains, this could indicate that the exact strain in our reference database is not a perfect match to the strain in the sample. However, if you also see high TOTAL matches for this same strain, this is a good indication that a near taxonomic neighbor of the reference strain is in your sample.
- For some organisms that have high representation in sequencing space (E. coli, for example), there may not be many unique areas in the genome available for strain level identification. As more and more similar genomes are added to the database, we lose the ability to discriminate between them as they are so highly similar to each other. This will cause the unique matches to be low.
What is total matches and when would I use it?
TOTAL matches is the total amount of shared + unique kmers found in your sample for an organism out of the total possible number of shared + unique kmers in our database that have been calculated for that reference organism. Shared kmers are shared with other similar organisms as they go up the phylogenetic tree.
A common use of percent matches is for comparing antibiotic resistance genes or virulence factors. When you do a comparative analysis for these databases you will notice that this is the default value used for the comparison. Since these databases are gene-based rather than organism-based, looking at the number of total kmers identified out of those possible is meaningful in understanding how much of the gene has been represented in the sample.
How do you calculate relative abundance?
Relative abundance is calculated based on the reference genome size, the number of organism specific kmers, and their observed frequency in the sample and then normalized to represent the abundance of each organism so that it is shown as a percentage.
Can you identify novel strains?
If you have genomes in your sample that are considered novel – where they have never been sequenced before – we will identify them to the closest taxonomic level possible. The ability to discriminate at a low taxonomic level depends on how different the organism is from its nearest neighbors that have been sequenced.
How secure is my data?
Your data is very secure. CosmosID follows industry-standard security best practices for account management, access and encryption. We have invested in the strongest SSL certificates that utilize sha2 encryption. The EV certificate is designed to strengthen security and combat phishing attacks to make EV SSL the most complete SSL certificate available.
All communications between our servers and your machine are served exclusively over HTTPS, the data is hosted on Amazon Web Services (AWS) and we use proper Identity and Access Management (IAM) controls on S3 which are critical for preventing information leakage.
My sample is really large and it won’t upload, what do I do?
If your samples are failing to load, please contact us at [email protected]. We will be glad to help you out.
Is there a maximum file size for upload and analysis?
We support file sizes up to 50GB shotgun (gzipped or unzipped) and up to 1GB ITS/16S (250MB .gz). If your data is larger than our file size limit, we recommend using reformat.sh script from BBTools package https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/reformat-guide/ to downsample the data to a lower read depth and then upload. If you need our assistance downsampling the data to a lower read depth, please contact [email protected].
What is the difference between alpha diversity and beta diversity?
Alpha diversity is how many different taxa are detected in each sample. Beta diversity is the difference in microbial composition between samples. Alpha diversity looks at each sample and asks how many. Beta diversity compares samples and asks what are the differences in microbial composition between the samples.
What does the presence of "u_s" and "u_t" in a taxa name refer to in the taxonomy results view?
Reads can be assigned to taxa that may not make it down to species or strain level. When we do not observe a strong enough signal to an individual species/strain, a "Branch" call at a higher phylogenetic level will instead be made. These would be calls such as "Escherichia" or "Enterobacteriaceae" if you were looking at an individual sample result on the app with the "strain-level statistics" view. These are considered in aggregation values. If aggregating upwards (towards phylum) they will be included. If aggregating downward (towards strain) they will be named as "Escherichia_u_s" (at species level) or "Escherichia_u_t" (at strain level) in the taxonomy switcher view.
Why do our Taxa and AMR and Virulence workflow not require neither quality trimming nor adapter removal from sequencing reads?
Our algorithm looks for exact matches of K-mer to our database of taxonomic K-mer biomarkers. If there are any ambiguous/wrong bases in the K-mer, the query K-mer will not have exact matches to our taxonomic relevant biomarkers and automatically be discarded and not considered for the analysis.
Why is rarefaction analysis not included in CosmosID-HUB
There are several limitations in performing rarefaction analysis with shotgun metagenomics data. Some of the limitations with rarefaction analysis are mentioned below.
- Loss of information: Rarefaction discards a significant amount of data by subsampling, which could lead to the loss of rare or low-abundance taxa. This may skew the diversity estimates and might not accurately represent the true microbial diversity in the samples.
- Sensitivity to sequencing depth: The results of rarefaction analysis can be highly sensitive to the chosen sequencing depth. Different depths can lead to different diversity estimates, making it difficult to determine the optimal depth for meaningful comparisons.
- Inefficiency: Rarefaction analysis can be computationally intensive, especially for large datasets with millions of reads or sequences. This can make the process time-consuming and resource-demanding.
- Biases in the estimation of diversity: The randomness of subsampling can introduce biases in the estimation of diversity indices, leading to inaccurate or misleading conclusions about the community structure and composition.
- Limited applicability to shotgun metagenomics data: Rarefaction analysis is primarily designed for amplicon sequencing data (e.g., 16S rRNA gene sequencing) and may not be as applicable or informative for shotgun metagenomics data. Shotgun metagenomics also provides information about the functional potential of the microbial community, which cannot be adequately captured by rarefaction analysis.
- Inability to account for differences in genome size: In shotgun metagenomics data, the genome size of different organisms can impact the abundance of taxa as well as individual genes. Rarefaction analysis does not account for these differences, which could lead to biased estimates of both taxa and gene abundance and diversity.
- Inability to capture functional redundancy: Rarefaction analysis is not well-suited to capture functional redundancy in shotgun metagenomics data, where multiple organisms may have similar genes or pathways. This limitation can lead to underestimation of the functional diversity and resilience of microbial communities.
How can you open TSV files in excel?
For both Windows and Mac, you need to right click on the file and then click on "open with" For Mac, Excel automatically comes in the list of recommended apps. For Windows, you need to choose the option to select more apps and then choose Excel.
I have samples I ran previously on the app and I would like to rerun them on the new databases, how do I do that?
Please contact us at [email protected] for more information
I have new samples to run, will I automatically get the new databases?
Yes, when you upload new samples they will be analyzed using the new databases unless requested otherwise.
Will my results change with the new databases?
If your samples are rerun on the new databases, it is possible that new species or strains will be identified in your samples that were not previously in the database and this may shift the abundance estimates for your previous results.
Updated 17 days ago