2/15/24 » Pipeline Release: KEPLER Host-Agnostic Microbiome Profiler
To complement our Human Profiler, CHAMP™, Kepler uses a host-agnostic curated database to process samples beyond the human host including but not limited to environmental, animal, soil, food samples and many others.
Kepler extracts optimal value from metagenomic data by combining the precision of K-mer exact-matching and the versatility of probabilistic alignment. Through this method, Kepler achieves robust identification and enumeration of bacteria, viruses, fungi, and protists by leveraging a meticulously curated biomarker database, where over 30,000 species are arranged in a phylogenetic tree-like structure.
The core of Kepler’s technology is patented in both the US Patent office (US10108778B2, US20200294628A1) and European Patent Office (ES2899879T3).
Most notably, Kepler marks a transition from NCBI-based nomenclature to GTDB-based nomenclature. Read more about [the NCBI vs. GTDB transition](GTDB vs. NCBI Taxonomic Nomenclature).
Kepler Advantages
How does Kepler work?
The Kepler multi-kingdom taxonomic profiler is divided into three parts:
1. Leveraging a Curated Database of Microbial Genomes
The Kepler database of high quality microbial genomes is based on high completeness:low contamination ratio, genome assembly quality and prioritizing intra-species diversity whilst limiting phylogenetic redundancy. The genome assemblies are then scrubbed clean of low complexity sequences, prophages, plasmids and host-contaminated regions to maximize the taxonomic signal-to-noise ratio. The final database encompasses multiple microbial kingdoms and >30,000 species.
2. Identifying Relevant Biomarkers
With the genomes curated and cleaned, they undergo a pre-computation phase where they’re split into n-mers of variable length. The n-mers are then categorized as either shared or unique biomarkers across individual genomes, which is facilitated by a phylogenetic tree-like data structure. The tree backbone represents shared genomic biomarkers between different taxa, while the tree leaves are individual microbial genomes with unique biomarkers.
3. Searching the Biomarker Database
The second per-sample, computational phase searches the millions of short sequence reads or contigs in your data against the phylogenetic tree-like database build:
- The first comparator splits the sequencing reads into k-mer sets that are then queried across the different branches and leaves of the phylogenetic tree to identify the different taxa present in the query kmer-sets. The first comparator splits the sequencing reads into k-mer sets that are then queried across the different branches and leaves of the phylogenetic tree to identify the different taxa present in the query kmer-sets. The first comparator looks for exact matches between query k-mers and reference bio-markers and classification sensitivity and accuracy is maintained through composite k-mer/biomarker aggregation statistics and coverage depth estimation.
- The second comparator uses an edit distance-scoring based probabilistic Smith-Waterman algorithm to compare sequencing reads with a reference set of identified microbial taxa using the first comparator. In conclusion, overall abundance precision and classification accuracy is achieved by running the comparators in sequence, scoring the entire read probabilistically against the reference set, and a final deconvolution step to distinguish homologous regions.
Evaluation of Kepler with Biological Community Standards
To benchmark Kepler, real-world community standards were utilized to compare its efficacy against leading profilers such as Kraken2/Bracken and MetaPhlAn4. For these comparisons, 5 different community standards were employed with both even and staggered (log distribution), from ATCC and Zymo.
Kepler distinguished itself not only by achieving a superior F1-Score (a balanced measure of precision and sensitivity) but also by its exceptional ability to detect low-abundance taxa (Bacteria and Fungi) as well as its precision in differentiating closely related taxa at the sub-species level, for example, Bifidobacterium longum subsp. longum and Bifidobacterium longum subsp. infantis.
Updated 26 days ago