Methods
How to Cite CosmosID
Reference for publications:
CosmosID Metagenomics Cloud, app.cosmosid.com, CosmosID Inc., www.cosmosid.com
Taxonomic Classification Methods:
The system utilizes a high performance data-mining k-mer algorithm that rapidly disambiguates millions of short sequence reads into the discrete genomes engendering the particular sequences. The pipeline has two separable comparators: the first consists of a pre-computation phase for reference databases and the second is a per-sample computation. The input to the pre-computation phase are databases of reference genomes, virulence markers and antimicrobial resistance markers that are continuously curated by CosmosID scientists. The output of the pre-computational phase is a phylogeny tree of microbes, together with sets of variable length k-mer fingerprints (biomarkers) uniquely associated with distinct branches and leaves of the tree. The second per-sample computational phase searches the hundreds of millions of short sequence reads, or alternatively contigs from draft de novo assemblies, against the fingerprint sets. This query enables the sensitive yet highly precise detection and taxonomic classification of microbial NGS reads. The resulting statistics are analyzed to return the fine-grain taxonomic and relative abundance estimates for the microbial NGS datasets. To exclude false positive identifications the results are filtered using a filtering threshold derived based on internal statistical scores that are determined by analyzing a large number of diverse metagenomes. The same approach is applied to enable the sensitive and accurate detection of genetic markers for virulence and for resistance to antibiotics.
Functional Classification Methods:
Initial QC, adapter trimming and preprocessing of metagenomic sequencing reads are done using BBduk (1). The quality controlled reads are then subjected to a translated search against a comprehensive and non-redundant protein sequence database, UniRef 90. The UniRef90 database, provided by UniProt (2), represents a clustering of all non-redundant protein sequences in UniProt, such that each sequence in a cluster aligns with 90% identity and 80% coverage of the longest sequence in the cluster. The mapping of metagenomic reads to gene sequences are weighted by mapping quality, coverage and gene sequence length to estimate community wide weighted gene family abundances as described by Franzosa et al (3). Gene families are then annotated to MetaCyc (4) reactions (Metabolic Enzymes) to reconstruct and quantify MetaCyc (4) metabolic pathways in the community as described by Franzosa et al (3). Furthermore, the UniRef_90 gene families are also regrouped to GO terms (5) in order to get an overview of GO functions in the community. Lastly, to facilitate comparisons across multiple samples with different sequencing depths, the abundance values are normalized using Total-sum scaling (TSS) normalization to produce "Copies per million" (analogous to TPMs in RNA-Seq) units.
References:
Bushnell, B. (2021). BBDuk Guide - DOE Joint Genome Institute. Retrieved 1 August 2021, from https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/
UniProt: the universal protein knowledgebase. (2016). Nucleic Acids Research, 45(D1), D158-D169. doi: 10.1093/nar/gkw1099
Franzosa, E., McIver, L., Rahnavard, G., Thompson, L., Schirmer, M., & Weingart, G. et al. (2018). Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods, 15(11), 962-968. doi: 10.1038/s41592-018-0176-y
Caspi, R., Foerster, H., Fulcher, C., Kaipa, P., Krummenacker, M., & Latendresse, M. et al. (2007). The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research, 36(Database), D623-D631. doi: 10.1093/nar/gkm900
Carbon, S., Ireland, A., Mungall, C., Shu, S., Marshall, B., & Lewis, S. (2008). AmiGO: online access to ontology and annotation data. Bioinformatics, 25(2), 288-289. doi: 10.1093/bioinformatics/btn615
16S Taxonomic Classification Methods:
For taxonomic profiling based of amplicon data, the CosmosID 16S data analysis pipeline starts with preprocessing of the raw reads from either paired-end or single-end Fastq files through read-trimming to remove adapters as well as reads and bases of low quality. If the reads are in a paired-end format, the forward and reverse overlapping pairs are joined together; the unjoined R1 and R2 reads are then added to the end of the file. The file is then converted to Fasta format and used as input for OTU picking. OTUs are identified against the CosmosID curated 16S database using a closed-reference OTU picker and 97% sequence similarity through the QIIME framework. The final results are then presented in tabular format with the taxonomic names, OTU IDs, frequency, and relative abundance. Results can be downloaded or compared to other 16S samples for visualizations through the CosmosID Comparative Analysis tool.
Updated about 2 years ago