Functional Classification

Functional 2.0 Workflow

Functional profiling from whole genome shotgun microbiome or metatranscriptomic short reads sequencing data provides crucial insights into the genomic potential of underlying molecular, biochemical and metabolic activities of microbial communities. Understanding the functional potential of a microbial community also allows testing of hypotheses to link or associate specific molecular or biochemical activities to environmental and health associated phenotypes. In order to aid scientists explore and investigate these hypotheses, we are pleased to introduce the Functional 2.0 Workflow in CosmosID-HUB Microbiome that leverages Enzyme Commission, MetaCyc Pathways, Pfam CAZy and GO Terms databases to characterize the functional potential of the microbiome community.

The single sample view of functional workflow entails the tabular view of all databases along with stacked bar chart and donut chart to aid in visual inspection of functional capabilities of the microbiome population.

Clicking on the first column for each respective functional databases will take you to that specific feature's description on that respective database's website.

Technical Appendix

Functional 2.0 Workflow:

Initial QC, adapter trimming and preprocessing of metagenomic sequencing reads are done using BBduk (1). The quality controlled reads are then subjected to a translated search against a comprehensive and non-redundant protein sequence database, UniRef 90. The UniRef90 database, provided by UniProt (2), represents a clustering of all non-redundant protein sequences in UniProt, such that each sequence in a cluster aligns with 90% identity and 80% coverage of the longest sequence in the cluster. The mapping of metagenomic reads to gene sequences are weighted by mapping quality, coverage and gene sequence length to estimate community wide weighted gene family abundances as described by Franzosa et al (3). Gene families are then annotated to MetaCyc (4) reactions (Metabolic Enzymes) to reconstruct and quantify MetaCyc (4) metabolic pathways in the community as described by Franzosa et al (3). Furthermore, the UniRef_90 gene families are also regrouped to Enzyme Commission Enzymes, Pfam protein domains, CAZy enzymes and GO Terms in order to get an exhaustive overview of gene functions in the community. Lastly, to facilitate comparisons across multiple samples with different sequencing depths, the abundance values are normalized using Total-sum scaling (TSS) normalization to produce "Copies per million" (analogous to TPMs in RNA-Seq) units.

Functional 1.0 Workflow

Initial QC, adapter trimming and preprocessing of metagenomic sequencing reads are done using BBduk (1). The quality controlled reads are then subjected to a translated search against a comprehensive and non-redundant protein sequence database, UniRef 90. The UniRef90 database, provided by UniProt (2), represents a clustering of all non-redundant protein sequences in UniProt, such that each sequence in a cluster aligns with 90% identity and 80% coverage of the longest sequence in the cluster. The mapping of metagenomic reads to gene sequences are weighted by mapping quality, coverage and gene sequence length to estimate community wide weighted gene family abundances as described by Franzosa et al (3). Gene families are then annotated to MetaCyc (4) reactions (Metabolic Enzymes) to reconstruct and quantify MetaCyc (4) metabolic pathways in the community as described by Franzosa et al (3). Furthermore, the UniRef_90 gene families are also regrouped to GO terms (5) in order to get an overview of GO functions in the community. Lastly, to facilitate comparisons across multiple samples with different sequencing depths, the abundance values are normalized using Total-sum scaling (TSS) normalization to produce "Copies per million" (analogous to TPMs in RNA-Seq) units.

References:
Bushnell, B. (2021). BBDuk Guide - DOE Joint Genome Institute. Retrieved 1 August 2021, from https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/

UniProt: the universal protein knowledgebase. (2016). Nucleic Acids Research, 45(D1), D158-D169. doi: 10.1093/nar/gkw1099

Franzosa, E., McIver, L., Rahnavard, G., Thompson, L., Schirmer, M., & Weingart, G. et al. (2018). Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods, 15(11), 962-968. doi: 10.1038/s41592-018-0176-y

Caspi, R., Foerster, H., Fulcher, C., Kaipa, P., Krummenacker, M., & Latendresse, M. et al. (2007). The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research, 36(Database), D623-D631. doi: 10.1093/nar/gkm900

Carbon, S., Ireland, A., Mungall, C., Shu, S., Marshall, B., & Lewis, S. (2008). AmiGO: online access to ontology and annotation data. Bioinformatics, 25(2), 288-289. doi: 10.1093/bioinformatics/btn615