CHAMP™ Human Microbiome Profiling

This section primarily provides technical methods for CHAMP™ Human Microbiome Profiling

A) Gene Catalog and Species Definition

The Clinical Microbiomics Human Microbiome Reference HMR05 gene catalog is derived primarily from high-quality (HQ) prokaryotic MAGs (metagenome-assembled genomes) identified from 30,382 human microbiome samples collected from nine distinct human body sites: gut (n = 19,296), oral (n = 4994), skin (n = 4306), urine (n = 934), small intestine (n = 780), nasopharyngeal (n = 422), vaginal (n = 367), airway (n = 108), and milk (n = 100). This includes MAGs available from UHGG (89,262 MAGs, PMID: 32690973) and from ELGG (25,303 MAGs, PMID: 36050292). 11% (n = 3317) of the samples were not publicly available. In addition, genome assemblies from NCBI and PATRIC were added to capture otherwise missing species of interest (human-associated pathogens, probiotics, food ingredients, and species relevant for benchmarking). Human-relevant eukaryotic species were manually identified from various sources, including an analysis of gut fungal species (PMID: 29178920), publicly available lists of pathogens, the eukaryotes profiled by MetaPhlAn 4 (PMID: 36823356), and various species relevant for benchmarking, resulting in 2740 genomes representing 244 species.

For MAGs not obtained from publicly available MAG collections, reads were host-filtered, trimmed, and assembled into contigs with Megahit (v. 1.2.9, PMID: 25609793) or metaSPAdes (v. 3.15.5, PMID: 28298430), and then binned using VAMB (v. 3.0.6, PMID: 33398153). MAGs were considered high-quality if they had > 90% completeness and < 5% contamination based on CheckM2 (v. 2022-07-19) and passed the GUNC chimerism test (v. 1.0.5, PMID: 34120611). All MAGs were taxonomically annotated using GTDB-Tk (v. 2.3.0, PMID: 36218463) with GTDB database version r214 (PMID: 34520557). To combine MAGs from multiple VAMB batches and MAG collections, MAGs annotated to the same species were merged into species clusters. MAGs without GTDB-Tk species-level annotations were merged with each other or with existing species clusters at 95% ANI (dRep, PMID: 28742071; FastANI, PMID: 30504855). This resulted in 6567 prokaryotic species clusters, 10% of which were unannotated at species level.
To derive a pan-genome catalog for each species, we used a three-step clustering approach. First, genes were clustered with MMseqs2 (v. 14, PMID: 29959318) with 98% identity and 90% bi-directional coverage. Second, the representatives from the first iteration were clustered with MMseqs2 with 95% identity and 90% bi-directional coverage. Representatives of the second iteration were chosen as the ones with highest cardinality from the first iteration. Third, the second iteration representatives were clustered with cd-hit (cd-hit-est, v. 4.8.1, PMID: 23060610) with 95% identity and 90% coverage of the shorter sequence. Genes shorter than 100 bp or with species prevalence < 1% were discarded.

For prokaryotes and eukaryotes separately, the entire set of pangenomes were then clustered with MMseqs2 with 97% identity and 90% bi-directional coverage to obtain between-species clusters. The pan-genomes from prokaryotic (n = 6567) and eukaryotic (n = 244) species were merged into a final gene catalog of 25,761,278 genes.
To enable quantification of each species in the database, up to 250 signature genes were selected for each species based on core genes (≥ 60% prevalence in species MAGs) with a length ≥ 200 bp and ≤ 20 kbp. Furthermore, signature genes were required to be species-unique, with no alignments of 100 bp with > 97% sequence-identity to other genes in the catalog; however, if fewer than 20 genes meeting this criteria were available for a species, then genes with segments > 200 bp without alignments to other genes were used, and non-unique segments of these genes were masked.

B) Sequencing data preprocessing

Raw FASTQ files were filtered to remove human genomic contamination by discarding read pairs in which either read mapped to the human reference genome GRCh38.p14 with Bowtie2 (v. 2.4.2, Langmead and Salzberg 2012, PMID: 22388286) in local alignment mode. Reads were then trimmed to remove adapters and bases with a Phred score below 30 using AdapterRemoval (v. 2.3.1, Schubert et al. 2016, PMID: 26868221). Trimmed, host-filtered read pairs with both lengths ≥ 100 bp, defined as high-quality non-host (HQNH) reads, were retained

C) Mapping reads to the gene catalog

HQNH reads were mapped to the gene catalog using BWA mem (v. 0.7.17) (Li and Durbin 2009, PMID: 19451168). An individual read was considered uniquely mapped to a gene if the MAPQ was ≥ 20 and the read aligned with ≥ 95 % identity over ≥ 100 bp. However, if > 10 bases of the read did not align to the gene or extend beyond the gene, the read was considered unmapped. Reads meeting the alignment length and identity criteria but not the MAPQ threshold were considered multi-mapped.

Each read pair was counted as either 1) uniquely mapped to a specific gene, if one or both individual reads were uniquely mapped to a gene, or 2) multi-mapped, if neither read was uniquely mapped, and at least one was multi-mapped, or 3) unmapped, if both individual reads were unmapped. If the two reads were each uniquely mapped to a different gene, the gene mapped by read 1 was counted but not the gene mapped by read 2. A gene count table was created with the number of uniquely mapped read pairs for each gene.

D) Species relative abundance calculation

The expected read counts for signature genes in each species in each sample were modeled with a negative binomial distribution as follows. First, if >=50 of the signature genes for a species had non-zero read counts and >=99% of genes were expected to have non-zero read counts given the total read count for that species (1-(((n_genes-1)/n_genes)^n_reads)>=0.99), then signature genes with zero reads were ignored in that sample. Second, the expected 99% quantile (between 0.5% and 99.5%) of read counts were calculated for each gene based on a negative binomial distribution with a mean proportional to the effective gene length (accounting for read length and mapping alignment criteria) and dispersion defined as log2(effective gene length). The abundance of each species was then calculated as the mean read count normalised by effective gene length based on reads mapping to signature genes with observed read counts within the expected 99% quantile. Species abundances were set to zero if less than 5 genes with non-zero read counts were within the 99% quantile. Furthermore, species with <66% of genes with non-zero read count within the 99% quantile were set to zero, unless the median abundance of signature genes was non-zero, in which case the median gene-length-corrected abundance of non-zero genes was used. Abundances were then normalized sample-wise such that the total abundance of all species sums to 100 %.

E) Functional annotation and profiling

EggNOG-mapper (v. 2.1.7, Diamond mode, Cantalapiedra et al. 2021, PMID: 34597405) was used to map prokaryotic genes in the gene catalog to the EggNOG orthologous groups database (v. 5.0, Huerta-Cepas et al. 2019, PMID: 30418610) and Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) database. Eukaryotic genes were annotated using KofamScan (PMID: 31742321). Functional potential profiles based on KOs were calculated as the proportion of the total gene abundance that mapped to a given KO.

Kyoto Encyclopedia of Genes and Genomes (KEGG) modules (v. 78.2) (Kanehisa and Goto 2000, PMID: 10592173) are defined as a set of KOs that enable a specific function or pathway. Functional potential profiles based on KEGG modules were generated from the species profiles. For this, we identified the set of species associated to each of the KEGG modules by following three criteria: 1) a species was associated to a KEGG module if it included at least 2/3 of the genes encoding the proteins/enzymes needed to complete the functionality of the module, 2) if a module had alternative reaction paths, only one of these was required to be 2/3 complete, 3) for modules with three or fewer steps, all steps were required to be comprised in the given species. KEGG module profiles based on counts, relative abundances, rarefied counts and rarefied relative abundances were then calculated by adding, respectively, the number of counts, relative abundances, rarefied counts and rarefied relative abundances from each species associated to a given KEGG module.

F) Gut-Brain Modules

The gut-brain modules (GBMs) are a set of 56 microbial pathways for metabolizing neuroactive compounds (molecules that have the potential to interact with the human nervous system). Each GBM corresponds to a single neuroactive compound synthesis or degradation process by members of the gut microbiota and is defined as a series of enzymatic steps represented by ortholog group identifiers (KEGG, TIGRFAM, and eggNOG version 3.0 orthology databases in order of preference) (Valles-Colomer et al. 2016, PMID: 30718848). We consider a species to contain a given module if the species includes genes annotated to at least 2/3 of the steps needed to complete the functionality of the module. If a module has alternative reaction paths, only one of these is required to be 2/3 complete. For modules with three or fewer steps, all steps are required to be comprised in the species.

G) Gut Metabolic Modules

Gut Metabolic Modules (GMMs) are a set of 103 conserved human gut metabolic pathways, each defined as a series of enzymatic steps represented by Kyoto Encyclopedia of Genes and Genomes orthology (KO) identifiers (Vieira-Silva et al. 2016, PMID: 27573110). Functional potential profiles based on GMMs were generated from the species profiles. For this, we identified the set of species associated to each of the GMMs by following three criteria: 1) a species was associated to a GMM if it included at least 2/3 of the genes encoding the proteins/enzymes needed to complete the functionality of the module, 2) if a module had alternative reaction paths, only one of these was required to be 2/3 complete, 3) for modules with three or fewer steps, all steps were required to be comprised in the given species. GMM profiles based on counts, relative abundances, rarefied counts and rarefied relative abundances were then calculated by adding, respectively, the number of counts, relative abundances, rarefied counts and rarefied relative abundances from each species associated to a given GMM.