Guides
Guides

2/15/24 » Pipeline Release: KEPLER Host-Agnostic Microbiome Profiler

To complement our Human Profiler, CHAMP™, Kepler uses a host-agnostic curated database to process samples beyond the human host including but not limited to environmental, animal, soil, food samples and many others.

Kepler extracts optimal value from metagenomic data by combining the precision of K-mer exact-matching and the versatility of probabilistic alignment. Through this method, Kepler achieves robust identification and enumeration of bacteria, viruses, fungi, and protists by leveraging a meticulously curated biomarker database, where over 30,000 species are arranged in a phylogenetic tree-like structure.

The core of Kepler’s technology is patented in both the US Patent office (US10108778B2, US20200294628A1) and European Patent Office (ES2899879T3).

Most notably, Kepler marks a transition from NCBI-based nomenclature to GTDB-based nomenclature. Read more about [the NCBI vs. GTDB transition](GTDB vs. NCBI Taxonomic Nomenclature).

📘

Download the Benchmark Whitepaper

Kepler Advantages


How does Kepler work?

The Kepler multi-kingdom taxonomic profiler is divided into three parts:

1. Leveraging a Curated Database of Microbial Genomes

The Kepler database of high quality microbial genomes is based on high completeness:low contamination ratio, genome assembly quality and prioritizing intra-species diversity whilst limiting phylogenetic redundancy. The genome assemblies are then scrubbed clean of low complexity sequences, prophages, plasmids and host-contaminated regions to maximize the taxonomic signal-to-noise ratio. The final database encompasses multiple microbial kingdoms and >30,000 species.

2. Identifying Relevant Biomarkers

With the genomes curated and cleaned, they undergo a pre-computation phase where they’re split into n-mers of variable length. The n-mers are then categorized as either shared or unique biomarkers across individual genomes, which is facilitated by a phylogenetic tree-like data structure. The tree backbone represents shared genomic biomarkers between different taxa, while the tree leaves are individual microbial genomes with unique biomarkers.

3. Searching the Biomarker Database

The second per-sample, computational phase searches the millions of short sequence reads or contigs in your data against the phylogenetic tree-like database build:

  • The first comparator splits the sequencing reads into k-mer sets that are then queried across the different branches and leaves of the phylogenetic tree to identify the different taxa present in the query kmer-sets. The first comparator splits the sequencing reads into k-mer sets that are then queried across the different branches and leaves of the phylogenetic tree to identify the different taxa present in the query kmer-sets. The first comparator looks for exact matches between query k-mers and reference bio-markers and classification sensitivity and accuracy is maintained through composite k-mer/biomarker aggregation statistics and coverage depth estimation.
  • The second comparator uses an edit distance-scoring based probabilistic Smith-Waterman algorithm to compare sequencing reads with a reference set of identified microbial taxa using the first comparator. In conclusion, overall abundance precision and classification accuracy is achieved by running the comparators in sequence, scoring the entire read probabilistically against the reference set, and a final deconvolution step to distinguish homologous regions.

Evaluation of Kepler with Biological Community Standards

To benchmark Kepler, real-world community standards were utilized to compare its efficacy against leading profilers such as Kraken2/Bracken and MetaPhlAn4. For these comparisons, 5 different community standards were employed with both even and staggered (log distribution), from ATCC and Zymo.

Kepler distinguished itself not only by achieving a superior F1-Score (a balanced measure of precision and sensitivity) but also by its exceptional ability to detect low-abundance taxa (Bacteria and Fungi) as well as its precision in differentiating closely related taxa at the sub-species level, for example, Bifidobacterium longum subsp. longum and Bifidobacterium longum subsp. infantis.

**Figure 1:** This figure illustrates the sensitivity of Kepler in comparison to other methods when applied to staggered community standards and fungal biological dataset. Kepler significantly outperforms in detecting both Bacteria and Fungi within these staggered standards, showcasing its superior sensitivity.

Figure 1: This figure illustrates the sensitivity of Kepler in comparison to other methods when applied to staggered community standards and fungal biological dataset. Kepler significantly outperforms in detecting both Bacteria and Fungi within these staggered standards, showcasing its superior sensitivity.