CHAMP™ Human Microbiome Profiling

The CHAMP™ Human Microbiome Profiler is a powerful taxonomic and functional pipeline for in-depth analysis of the human microbiome, utilizing cutting-edge computational methods to offer high-resolution insights into microbial communities. CHAMP™ applies advanced taxonomic profiling techniques, enabling accurate identification of microbial species across diverse sample types. It supports researchers in deciphering the complexity of the human microbiome by leveraging curated reference databases and statistical models to ensure robust, reproducible results. This profiler also integrates multi-omic data for comprehensive analysis, helping to uncover associations between microbial diversity and health outcomes.

How was CHAMP™ developed?

CHAMP™ employs over 400,000 metagenome-assembled genomes (MAGs), created from a collection of more than 30,000 microbiome samples from individuals across the world. These samples were sourced from 9 different human body sites including but not limited to: gut (stool and small intestine), vagina, skin and mouth. From this extensive human microbiome data collection, we have identified almost 7000 bacteria, archaea, and eukaryote species, including many newly discovered species. This is the foundation of our microbiome analysis services.

CHAMP™ provides the most accurate and sensitive microbiome profiling data thanks to the following features:

  • High sensitivity - Detects low abundant and rare species in samples.
  • Superior specificity - Exclusively detects species actually present in samples.
  • Abundance accuracy - Provides correct abundance estimations.
  • Accurate taxonomic annotation - Microorganisms correctly annotated at species resolution.

CHAMP™ Benchmarking

CHAMP™ demonstrates best-in-class performance using both CAMI and NIBSC benchmarks against the best and most widely used profiling pipelines in the field.

  • Compared to MetaPhlAn4, CHAMP™ showed 16% greater sensitivity (recall) across different human body sites and showed an astounding 400 times lower false signal in the NIBSC mock community benchmark compared to state-of-the-art profilers (MetaPhlAn4, Centrifuge, Kraken, and Bracken). This means that when CHAMP™ detects something, you can trust it’s there.
  • CHAMP™ uses the latest GTDB annotation which includes more rare species and, compared to NCBI, often classifies sub-species as species in their own right.

Best-in-class microbiome profiling data

Combining the world’s most comprehensive reference catalogue for human-associated microbes with a proprietary algorithm for detection and abundance estimation of species, CHAMP™ provides you with the best microbiome data for further analysis.

For analyzing low biomass samples, searching for very rare species, profiling the microbiome of large-scale population health studies, or designing next generation probiotics or live biotherapeutics, CHAMP™ sets new standards in sensitivity and accuracy for microbiome research.

CHAMP™ also provides phage and virome profiling and advanced functional annotations. Furthermore, with the addition of our clonal-level analysis tool, it can separate very closely related strains, helping researchers understand the dynamics between probiotics/therapeutic microbes and endogenous populations of the same species.

CHAMP™ is available as part of our end-to-end human microbiome research services or as a standalone service for profiling or re-profiling of shotgun sequencing data.

By combining high-precision microbiome profiling with customizable analysis workflows, CHAMP enables researchers to efficiently explore the vast landscape of microbial interactions within the human body, ultimately advancing our understanding of microbiome-related diseases and therapies.


CHAMP™ Manuscript

CHAMP™ has been described in our BioRxiv preprint, which is available below.


Detailed Methods Summary

Gene Catalog and Species Definition

The Clinical Microbiomics Human Microbiome Reference HMR05 gene catalog is derived primarily from high-quality (HQ) prokaryotic MAGs (metagenome-assembled genomes) identified from 30,382 human microbiome samples collected from nine distinct human body sites: gut (n = 19,296), oral (n = 4994), skin (n = 4306), urine (n = 934), small intestine (n = 780), nasopharyngeal (n = 422), vaginal (n = 367), airway (n = 108), and milk (n = 100). This includes MAGs available from UHGG (89,262 MAGs, PMID: 32690973) and from ELGG (25,303 MAGs, PMID: 36050292). 11% (n = 3317) of the samples were not publicly available. In addition, genome assemblies from NCBI and PATRIC were added to capture otherwise missing species of interest (human-associated pathogens, probiotics, food ingredients, and species relevant for benchmarking). Human-relevant eukaryotic species were manually identified from various sources, including an analysis of gut fungal species (PMID: 29178920), publicly available lists of pathogens, the eukaryotes profiled by MetaPhlAn 4 (PMID: 36823356), and various species relevant for benchmarking, resulting in 2740 genomes representing 244 species.

For MAGs not obtained from publicly available MAG collections, reads were host-filtered, trimmed, and assembled into contigs with Megahit (v. 1.2.9, PMID: 25609793) or metaSPAdes (v. 3.15.5, PMID: 28298430), and then binned using VAMB (v. 3.0.6, PMID: 33398153). MAGs were considered high-quality if they had > 90% completeness and < 5% contamination based on CheckM2 (v. 2022-07-19) and passed the GUNC chimerism test (v. 1.0.5, PMID: 34120611). All MAGs were taxonomically annotated using GTDB-Tk (v. 2.3.0, PMID: 36218463) with GTDB database version r214 (PMID: 34520557). To combine MAGs from multiple VAMB batches and MAG collections, MAGs annotated to the same species were merged into species clusters. MAGs without GTDB-Tk species-level annotations were merged with each other or with existing species clusters at 95% ANI (dRep, PMID: 28742071; FastANI, PMID: 30504855). This resulted in 6567 prokaryotic species clusters, 10% of which were unannotated at species level.
To derive a pan-genome catalog for each species, we used a three-step clustering approach. First, genes were clustered with MMseqs2 (v. 14, PMID: 29959318) with 98% identity and 90% bi-directional coverage. Second, the representatives from the first iteration were clustered with MMseqs2 with 95% identity and 90% bi-directional coverage. Representatives of the second iteration were chosen as the ones with highest cardinality from the first iteration. Third, the second iteration representatives were clustered with cd-hit (cd-hit-est, v. 4.8.1, PMID: 23060610) with 95% identity and 90% coverage of the shorter sequence. Genes shorter than 100 bp or with species prevalence < 1% were discarded.

For prokaryotes and eukaryotes separately, the entire set of pangenomes were then clustered with MMseqs2 with 97% identity and 90% bi-directional coverage to obtain between-species clusters. The pan-genomes from prokaryotic (n = 6567) and eukaryotic (n = 244) species were merged into a final gene catalog of 25,761,278 genes.
To enable quantification of each species in the database, up to 250 signature genes were selected for each species based on core genes (≥ 60% prevalence in species MAGs) with a length ≥ 200 bp and ≤ 20 kbp. Furthermore, signature genes were required to be species-unique, with no alignments of 100 bp with > 97% sequence-identity to other genes in the catalog; however, if fewer than 20 genes meeting this criteria were available for a species, then genes with segments > 200 bp without alignments to other genes were used, and non-unique segments of these genes were masked.

Sequencing data preprocessing

Raw FASTQ files were filtered to remove human genomic contamination by discarding read pairs in which either read mapped to the human reference genome GRCh38.p14 with Bowtie2 (v. 2.4.2, Langmead and Salzberg 2012, PMID: 22388286) in local alignment mode. Reads were then trimmed to remove adapters and bases with a Phred score below 30 using AdapterRemoval (v. 2.3.1, Schubert et al. 2016, PMID: 26868221). Trimmed, host-filtered read pairs with both lengths ≥ 100 bp, defined as high-quality non-host (HQNH) reads, were retained

Mapping reads to the gene catalog

HQNH reads were mapped to the gene catalog using BWA mem (v. 0.7.17) (Li and Durbin 2009, PMID: 19451168). An individual read was considered uniquely mapped to a gene if the MAPQ was ≥ 20 and the read aligned with ≥ 95 % identity over ≥ 100 bp. However, if > 10 bases of the read did not align to the gene or extend beyond the gene, the read was considered unmapped. Reads meeting the alignment length and identity criteria but not the MAPQ threshold were considered multi-mapped.

Each read pair was counted as either 1) uniquely mapped to a specific gene, if one or both individual reads were uniquely mapped to a gene, or 2) multi-mapped, if neither read was uniquely mapped, and at least one was multi-mapped, or 3) unmapped, if both individual reads were unmapped. If the two reads were each uniquely mapped to a different gene, the gene mapped by read 1 was counted but not the gene mapped by read 2. A gene count table was created with the number of uniquely mapped read pairs for each gene.

Species relative abundance calculation

The expected read counts for signature genes in each species in each sample were modeled with a negative binomial distribution as follows. First, if >=50 of the signature genes for a species had non-zero read counts and >=99% of genes were expected to have non-zero read counts given the total read count for that species (1-(((n_genes-1)/n_genes)^n_reads)>=0.99), then signature genes with zero reads were ignored in that sample. Second, the expected 99% quantile (between 0.5% and 99.5%) of read counts were calculated for each gene based on a negative binomial distribution with a mean proportional to the effective gene length (accounting for read length and mapping alignment criteria) and dispersion defined as log2(effective gene length). The abundance of each species was then calculated as the mean read count normalised by effective gene length based on reads mapping to signature genes with observed read counts within the expected 99% quantile. Species abundances were set to zero if less than 5 genes with non-zero read counts were within the 99% quantile. Furthermore, species with <66% of genes with non-zero read count within the 99% quantile were set to zero, unless the median abundance of signature genes was non-zero, in which case the median gene-length-corrected abundance of non-zero genes was used. Abundances were then normalized sample-wise such that the total abundance of all species sums to 100 %.

Functional annotation and profiling

EggNOG-mapper (v. 2.1.7, Diamond mode, Cantalapiedra et al. 2021, PMID: 34597405) was used to map prokaryotic genes in the gene catalog to the EggNOG orthologous groups database (v. 5.0, Huerta-Cepas et al. 2019, PMID: 30418610) and Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) database. Eukaryotic genes were annotated using KofamScan (PMID: 31742321). Functional potential profiles based on KOs were calculated as the proportion of the total gene abundance that mapped to a given KO.

Kyoto Encyclopedia of Genes and Genomes (KEGG) modules (v. 78.2) (Kanehisa and Goto 2000, PMID: 10592173) are defined as a set of KOs that enable a specific function or pathway. Functional potential profiles based on KEGG modules were generated from the species profiles. For this, we identified the set of species associated to each of the KEGG modules by following three criteria: 1) a species was associated to a KEGG module if it included at least 2/3 of the genes encoding the proteins/enzymes needed to complete the functionality of the module, 2) if a module had alternative reaction paths, only one of these was required to be 2/3 complete, 3) for modules with three or fewer steps, all steps were required to be comprised in the given species. KEGG module profiles based on counts, relative abundances, rarefied counts and rarefied relative abundances were then calculated by adding, respectively, the number of counts, relative abundances, rarefied counts and rarefied relative abundances from each species associated to a given KEGG module.

Gut-Brain Modules

The gut-brain modules (GBMs) are a set of 56 microbial pathways for metabolizing neuroactive compounds (molecules that have the potential to interact with the human nervous system). Each GBM corresponds to a single neuroactive compound synthesis or degradation process by members of the gut microbiota and is defined as a series of enzymatic steps represented by ortholog group identifiers (KEGG, TIGRFAM, and eggNOG version 3.0 orthology databases in order of preference) (Valles-Colomer et al. 2016, PMID: 30718848). We consider a species to contain a given module if the species includes genes annotated to at least 2/3 of the steps needed to complete the functionality of the module. If a module has alternative reaction paths, only one of these is required to be 2/3 complete. For modules with three or fewer steps, all steps are required to be comprised in the species.

Gut Metabolic Modules

Gut Metabolic Modules (GMMs) are a set of 103 conserved human gut metabolic pathways, each defined as a series of enzymatic steps represented by Kyoto Encyclopedia of Genes and Genomes orthology (KO) identifiers (Vieira-Silva et al. 2016, PMID: 27573110). Functional potential profiles based on GMMs were generated from the species profiles. For this, we identified the set of species associated to each of the GMMs by following three criteria: 1) a species was associated to a GMM if it included at least 2/3 of the genes encoding the proteins/enzymes needed to complete the functionality of the module, 2) if a module had alternative reaction paths, only one of these was required to be 2/3 complete, 3) for modules with three or fewer steps, all steps were required to be comprised in the given species. GMM profiles based on counts, relative abundances, rarefied counts and rarefied relative abundances were then calculated by adding, respectively, the number of counts, relative abundances, rarefied counts and rarefied relative abundances from each species associated to a given GMM.