Guides
Guides

Methods for Manuscripts

How to Cite CosmosID-HUB

Reference for publications:

CosmosID-HUB, www.cosmosidhub.com


CHAMP Human Taxonomic Profiling Methods

Taxonomic profiling of the human microbiome samples was performed using the Clinical Microbiomics Human Microbiome Profiler (CHAMP) version 1.01 [1] within the CosmosID-HUB (www.cosmosidhub.com). CHAMP employs a marker gene-based approach synergized with metagenome-assembled genome database for the profiling of prokaryotes, eukaryotes, and viruses from short-read sequencing data. CHAMP utilizes an internal reference database of 6,567 prokaryotic species (6,546 bacterial and 21 archaeal, GTDB-based taxonomy) and 244 eukaryotic species. It was derived from 30,382 human microbiome samples collected across nine distinct human body sites, including gut, small intestinal biopsies, oral, skin, urine, nasopharyngeal, vaginal, airway, and milk samples. Raw sequencing reads were preprocessed using AdapterRemoval (v. 2.3.1) [2] to remove adapters and low-quality bases (Phred score < 30). Host contamination was removed by discarding read pairs mapping to the human reference genome GRCh38 using Bowtie2 v. 2.4.2 [3]. High-quality non-host reads with a minimum length of 100 bp were retained and mapped to the CHAMP database using BWA mem v. 0.7.17 [4]. Species abundances were calculated using a negative binomial distribution model for signature gene read counts, with normalization for effective gene length. Species with insufficient signature gene coverage or read counts were filtered out. Detection criteria included a minimum of 5 aligned paired-end reads and either ≥99% nucleotide identity with ≥5% genome coverage, or ≥95% nucleotide identity with ≥30% genome coverage. CHAMP provided taxonomic profiles including relative abundances present in each sample, normalized to sum to 100% for each sample.

CHAMP also incorporates functional annotation and profiling to provide insights into the metabolic potential of the microbiome. EggNOG-mapper (v. 2.1.7, Diamond mode) [5] was used to map prokaryotic genes in the gene catalog to the EggNOG orthologous groups database (v. 5.0) [6] and Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) database [7]. For eukaryotic genes, KofamScan was employed for annotation [8]. CHAMP also utilizes several types of functional modules to provide a comprehensive understanding of the microbiome's metabolic potential. Functional potential profiles were generated from the species profiles and reported as cellular abundance, which accounts for species associations when a function is included at least 2/3 of the genes needed to complete the module's functionality. KEGG modules (v. 78.2) [7] are defined as sets of KOs that enable specific functions or pathways, and KOs were calculated as the proportion of the total gene abundance that mapped to a given KO. Gut-Brain Modules (GBMs), a set of 56 microbial pathways for metabolizing neuroactive compounds, were also incorporated. Each GBM corresponds to a single neuroactive compound synthesis or degradation process. Additionally, Gut Metabolic Modules (GMMs), a set of 103 conserved human gut metabolic pathways, were included. For both GBMs and GMMs, a species was considered to contain a given module if it included genes annotated to at least 2/3 of the steps needed to complete the module's functionality.


References:

[1] Pita S, Myers PN, Johansen J, Russel J, Nielsen MC, Eklund AC and Nielsen HB (2024) CHAMP delivers accurate taxonomic profiles of the prokaryotes, eukaryotes, and bacteriophages in the human microbiome. Front. Microbiol. 15:1425489. doi: 10.3389/fmicb.2024.1425489

[2] Schubert, M., Lindgreen, S., & Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 9, 88. https://doi.org/10.1186/s13104-016-1900-2

[3] Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923

[4] Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

[5] Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution, 38(12), 5825–5829. https://doi.org/10.109

[6] Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P., & Huerta-Cepas, J. (2021). eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution, 38(12), 5825–5829. https://doi.org/10.1093/molbev/msab293

[7] Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M., & Ishiguro-Watanabe, M. (2023). KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Research, 51(D1), D587–D592. https://doi.org/10.1093/nar/gkac963

[8] Beghini, F., McIver, L. J., Blanco-Míguez, A., Dubois, L., Asnicar, F., Maharjan, S., Mailyan, A., Manghi, P., Scholz, M., Thomas, A. M., Valles-Colomer, M., Weingart, G., Zhang, Y., Zolfo, M., Huttenhower, C., Franzosa, E. A., & Tett, A. (2019). Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Bioinformatics, 37(8), 104–110. https://doi.org/10.1093/bioinformatics/btz859


KEPLER Host-Agnostic Taxonomic Profiling Methods:

Kepler v1.1.0 (accessed via CosmosID-HUB, www.cosmosidhub.com) utilizes a high performance data-mining k-mer algorithm that rapidly disambiguates millions of short sequence reads into the discrete genomes engendering the particular sequences. The pipeline has two separable comparators: the first consists of a pre-computation phase for reference databases and the second is a per-sample computation. The input to the pre-computation phase is a database of >150,000 GTDB-based reference genomes and genes that are continuously curated by CosmosID scientists. The output of the pre-computational phase is a phylogeny tree of microbes, together with sets of variable length k-mer fingerprints (biomarkers) uniquely associated with distinct branches and leaves of the tree. The second per-sample computational phase searches the hundreds of millions of short sequence reads, or alternatively contigs from draft de novo assemblies, against the fingerprint sets. This query enables the sensitive yet highly precise detection and taxonomic classification of microbial NGS reads. The resulting statistics are analyzed to return the strain-resoultion taxonomic and relative abundance estimates for the microbial NGS datasets.

_If filtered dataset is used: _To exclude false positive identifications, the results are filtered using a filtering threshold derived based on internal statistical scores that are determined by analyzing a large number of diverse metagenomes. The same approach is applied to enable the sensitive and accurate detection of genetic markers for virulence and for resistance to antibiotics.

Example Citations

The papers mentioned below have previously cited the CosmosID-HUB for referencing to our taxonomic profiling method.

  1. https://www.science.org/doi/10.1126/science.adj3502

  2. https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2023.1165980/full

  3. https://www.cell.com/cell-host-microbe/pdfExtended/S1931-3128(19)30158-1

  4. https://www.nature.com/articles/s41598-022-14118-9#Sec2

  5. https://journals.asm.org/doi/full/10.1128/mbio.00591-22.


FUNCTIONAL Host-Agnostic Profiling Methods:

Initial QC, adapter trimming and preprocessing of metagenomic sequencing reads are done using BBduk (1). The quality controlled reads are then subjected to a translated search against a comprehensive and non-redundant protein sequence database, UniRef 90. The UniRef90 database, provided by UniProt (2), represents a clustering of all non-redundant protein sequences in UniProt, such that each sequence in a cluster aligns with 90% identity and 80% coverage of the longest sequence in the cluster. The mapping of metagenomic reads to gene sequences are weighted by mapping quality, coverage and gene sequence length to estimate community wide weighted gene family abundances as described by Franzosa et al (3). Gene families are then annotated to MetaCyc (4) reactions (Metabolic Enzymes) to reconstruct and quantify MetaCyc (4) metabolic pathways in the community as described by Franzosa et al (3). Furthermore, the UniRef_90 gene families are also regrouped to GO terms (5) in order to get an overview of GO functions in the community. Lastly, to facilitate comparisons across multiple samples with different sequencing depths, the abundance values are normalized using Total-sum scaling (TSS) normalization to produce "Copies per million" (analogous to TPMs in RNA-Seq) units.

References:

  1. Bushnell, B. (2021). BBDuk Guide - DOE Joint Genome Institute. Retrieved 1 August 2021, from https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/
  2. UniProt: the universal protein knowledgebase. (2016). Nucleic Acids Research, 45(D1), D158-D169. doi: 10.1093/nar/gkw1099
  3. Franzosa, E., McIver, L., Rahnavard, G., Thompson, L., Schirmer, M., & Weingart, G. et al. (2018). Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods, 15(11), 962-968. doi: 10.1038/s41592-018-0176-y
  4. Caspi, R., Foerster, H., Fulcher, C., Kaipa, P., Krummenacker, M., & Latendresse, M. et al. (2007). The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research, 36(Database), D623-D631. doi: 10.1093/nar/gkm900
  5. Carbon, S., Ireland, A., Mungall, C., Shu, S., Marshall, B., & Lewis, S. (2008). AmiGO: online access to ontology and annotation data. Bioinformatics, 25(2), 288-289. doi: 10.1093/bioinformatics/btn615


16S ASV-based Profiling Methods

The CosmosID-HUB Microbiome’s 16S workflow implements the DADA2 algorithm(3) as its core engine and utilizes the Nextflow ampliseq pipeline(1) definitions to run it on our cloud infrastructure. Briefly, primer removal is done with Cutadapt (4), and quality trimming parameters are passed to DADA2 to ensure that the median quality score over the length of the read exceeds a certain Phred score threshold. Within DADA2, forward and reverse reads are each trimmed to a uniform length based on the quality of reads in the sample—higher quality data will generally result in longer reads. DADA2 uses machine learning with a parametric error model to learn the error rates for the forward and reverse reads, based on the premise that correct sequences should be more common than any particular error-variant. DADA2 then applies its core sample inference algorithm to the filtered and trimmed data, applying these learned error models. Paired-end reads are then merged if they have at least 12 bases of overlap and are identical across the entire overlap.

The resulting table of sequences and observed frequencies is filtered to remove chimeric sequences (those that exactly match a combination of more-prevalent “parent” sequences). Taxonomy and species-level identification (where possible) are conducted with DADA2’s naive Bayesian classifier, using the Silva version 138 database.

Lastly, the predicted functional potential of the community was profiled using PICRUST2 (5)(6)(7)(8)(9). Briefly, PICRUSt2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States) is a tool that predicts functional capabilities and abundances of a microbial community based on the observed amplicon (marker gene) content. Functional capabilities are given by EC classifiers, or MetaCyc ontologies, and these can be aggregated to predict pathways that are likely present in a given sample.

References:

  1. Straub, D. et al. Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline. Front. Microbiol. 11, 1–18 (2020).
  2. Callahan, B. J., McMurdie, P. J. & Holmes, S. P. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J. 11, 2639–2643 (2017).
  3. Callahan, B. J. et al. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods 13, 581–583 (2016).
  4. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10 (2011).
  5. Douglas, G. M. et al. PICRUSt2 for prediction of metagenome functions. Nat. Biotechnol. 38, 685–688 (2020).
  6. Barbera, P. et al. EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences. Syst. Biol. 68, 365–369 (2019).
  7. Czech, L., Barbera, P. & Stamatakis, A. Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data. Bioinformatics 36, 3263–3265 (2020).
  8. MIRARAB, S., NGUYEN, N. & WARNOW, T. SEPP: SATé-Enabled Phylogenetic Placement. in Biocomputing 2012 247–258 (WORLD SCIENTIFIC, 2011). doi:10.1142/9789814366496_0024.
  9. Louca, S. & Doebeli, M. Efficient comparative phylogenetics on large trees. Bioinformatics 34, 1053–1055 (2018).
  10. Ye, Y. & Doak, T. G. A Parsimony Approach to Biological Pathway Reconstruction/Inference for Genomes and Metagenomes. PLoS Comput. Biol. 5, e1000465 (2009).
  11. Chiarello, M., McCauley, M., Villéger, S. & Jackson, C. R. Ranking the biases: The choice of OTUs vs. ASVs in 16S rRNA amplicon data analysis has stronger effects on diversity measures than rarefaction and OTU identity threshold. PLoS One 17, 1–19 (2022).

16S OTU-based Profiling Methods:

For taxonomic profiling based of amplicon data, the CosmosID 16S data analysis pipeline starts with preprocessing of the raw reads from either paired-end or single-end Fastq files through read-trimming to remove adapters as well as reads and bases of low quality. If the reads are in a paired-end format, the forward and reverse overlapping pairs are joined together; the unjoined R1 and R2 reads are then added to the end of the file. The file is then converted to Fasta format and used as input for OTU picking. OTUs are identified against the CosmosID curated 16S database using a closed-reference OTU picker and 97% sequence similarity through the QIIME framework. The final results are then presented in tabular format with the taxonomic names, OTU IDs, frequency, and relative abundance. Results can be downloaded or compared to other 16S samples for visualizations through the CosmosID Comparative Analysis tool.