The Long-Read Amplicon Pipeline on Cosmos-Hub is an Emu-based algorithm that delivers species-level microbial community profiling from full-length 16S/18S/ITS rRNA genes sequences generated by long-read sequencing platforms like PacBio and Oxford Nanopore Technologies (ONT). Unlike traditional short-read 16S approaches that target only hypervariable regions (V3-V4), long-read amplicon sequencing captures the entire ~1,500 bp gene, including all nine hypervariable regions (V1-V9). This comprehensive coverage unlocks species- and subspecies-level resolution that short-read methods simply cannot achieve. LR16 Pn
How does Cosmos-Hub’s Emu implementation differ from the standard Emu pipeline?The Cosmos-Hub implementation includes several enhancements over the standard Emu parameters:
  • Optimal performance across diverse sample types through improved fine tuning of minimap2 parameters
  • Enhanced EM Configuration via probability and iteration parameters for improved convergence and reduced false positives
  • Multi-database support with parallel processing across specialized reference databases
  • End-to-end automation from raw reads to publication-ready taxonomic profiles
  • Advanced quality control with comprehensive filtering statistics and visualizations
  • Parameter-rich customization tailored to your specific study requirements

Required Parameters

There are a few preprocessing parameters that are required to run the long-read amplicon pipeline:
  1. Database Selection: select the database most appropriate for your sample type (up to 2, see below for definitions of databases)
  2. Minimum/Maximum Read Length: Should be adjusted based on amplicon region (V1-V8 vs V1-V9) and expected sequence quality. Minimum length should range between ~800 (archaea-enriched or low-quality sequenced samples) and ~1200 bases for single gene sequencing, and ~3500 for full operon sequencing.
  3. Minimum Read Quality: Recommended thresholds based on data source:
    • High quality data (expected from Cmbio): Q-score ≥ 20
    • Medium quality data: Q-score 17-20
    • Lower quality data: Q-score 15-17
  4. Alignment Preset Settings: choose the sequencing platform used to produce the data among ONT or PacBio
  5. Maximum EM Iterations: Numbers above 20 (recommended), in most cases do not lead to significant precision improvement, while causing longer running times. Choices of higher values (e.g. 30/40) could be beneficial in case of complex communities such as soil samples.
  6. Probability cutoff: reduce false positive hits in case of bad primary alignments. Too high values lead to overestimation of the most abundant species and can be supported in case of simple and well defined communities, with taxa phylogenetically distant (e.g. mock communities).
While more details on the choice of the reference database are reported in the next section, here we suggest the combinations of parameters for the most common user cases:
Type of SequencesHQ 16S bacterial V1-V8HQ 16S bacterial V1-V9MQ 16S bacterial V1-V9HQ 18S eukaryotic V4-V9HQ ITS Fungal ITS1 / ITS2Full Operon 16S-ITS-23S rRNA operon
HUB Workflow NameLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read ITS Amplicon ProfilingFull Length rRNA Amplicon (16S-ITS-23S) Profiling (Beta)
Min. read length1100120010005001002500
Max. read length20002500100002000500010000
Min. read quality202017202020
Max. EM iterations202025252520
Probability cutoff0.950.950.90.90.90.9
Reference DBGTDBGTDBGTDBSILVAUNITEMIrROR

Database Selection

Emu supports multiple curated databases, each optimized for specific sample types and research applications:
Most users should use GTDB + GreenGenes2 as the default combination.
Available Databases
  • SILVA r138.1: Comprehensive, manually curated database ideal for environmental samples
  • GTDB r220: Genome Taxonomy Database with standardized prokaryotic taxonomy
  • Greengenes2 202210: Updated classification system with improved phylogenetic framework (not updated to newest version 202409 because of problems with important taxonomies, such as the lack of reference sequences labelled as Escherichia coli https://github.com/biocore/q2-greengenes2/issues/29)
  • HOMD v15.23: Specialized for oral microbiome studies
  • MiDAS v5.3: Optimized for activated sludge and wastewater treatment systems
  • UNITE v10.0: Fungal ITS sequences for multi-kingdom studies
  • MIrROR v2.0: Ribosomal RNA gene database (16S-ITS-23S rRNA operon sequences extracted from bacterial genomes)
  • Emu Default v3.4.5: Curated combination of rrnDB v5.6 and NCBI 16S RefSeq

Sample-Specific Database Recommendations

### Sample Type/Environment### Recommended Database(s)
Human gut microbiomeGTDB, Greengenes2, eHOMDB (for oral-associated strains)
Human skin microbiomeGTDB, Greengenes2
Human vaginal microbiomeGTDB, Greengenes2
Oral microbiomeHOMD, SILVA
Soil/Environmental samplesSILVA, GTDB, Greengenes2
Aquatic (marine/freshwater)SILVA, GTDB, Greengenes2
Activated sludge/wastewaterMiDAS
Fungal-dominated samplesUNITE (for ITS), SILVA (for multi-kingdom)
Poorly characterized biomesCombine GTDB + SILVA or GTDB + Greengenes2
Fast analysis of simple communitiesEmu default

Key Benefits Over Short-Read Approaches

Species-Level Resolution: Full-length gene coverage enables discrimination between highly similar speciesReduced Misclassification: Complete sequence information minimizes ambiguous taxonomic assignment.Enhanced Phylogenetic Resolution: Critical for distinguishing closely related species in complex microbiomesImproved Rare Taxa Detection: Better sensitivity for low-abundance community members

Core Methodology

The Emu workflow implements a sophisticated two-stage process optimized for both high-quality and error-prone long-read data:
1

Stage 1: Alignment Generation

The pipeline begins by generating alignments between input reads and the supplied reference database using minimap2, a versatile pairwise aligner optimized for long sequences. Minimap2 employs a seed-chain-align procedure that:
  • Collects minimizers from reference sequences and indexes them in a hash table
  • Identifies exact matches (anchors) between query and reference sequences
  • Chains collinear anchors and applies dynamic programming for base-level alignment
  • Accounts for the error profiles specific to ONT and PacBio sequencing platforms
2

Stage 2: Expectation-Maximization (EM) Error Correction

The core innovation of Emu lies in its EM-based error correction algorithm:

Initial Probabilities:

Iterative Refinement

Final Processing

Performance Benchmarking

Extensive validation studies demonstrate Emu’s superior performance compared to alternative long-read taxonomic profilers:
  • L1-norm Error: Consistently lowest relative abundance error across test datasets
  • Precision/Recall: Optimal balance between true positive detection and false positive control
  • Species Detection: Superior identification of closely related species (e.g., 97% ANI discrimination)

Data Upload and Analysis

Supported Input Formats

  • Raw FASTA and FASTQ files (compressed format supported: fastq,fastq.gz,fastq.bz2,fasta,fastq,fastq.gz,fastq.bz2,fq,fq.gz)
  • Demultiplexed long-read sequences
  • Both single-end and paired-end data

Sample Assessment Features

  • Real-time read quality visualization
  • Sequence length distribution analysis
  • Taxonomic assignment confidence scoring
  • Comparative analysis with other samples

Results Format

  • 7 Taxonomic-level (down to species) abundance profiles
  • Taxonomic classification tables
  • Quality control metrics
  • Interactive visualization dashboards

Comparative Analysis Integration:

Results seamlessly integrate with Cosmos-Hub’s comparative analysis tools for multi-sample comparison and statistical analysis. Data is exportable in standard formats along with publication-ready vizualizations.

Methodology publications and validation studies

  • Curry, K.D., et al. “Emu: Species-Level Microbial Community Profiling for Full-Length Nanopore 16S Reads.” Nature Methods (2022)
  • Li, H. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics (2018)

Applications and Use Cases

Microbiome Research

  • Species-level community profiling
  • Functional pathway prediction
  • Longitudinal microbiome dynamics
  • Multi-kingdom microbiome analysis

Clinical Microbiology

  • Rapid pathogen identification from clinical specimens
  • Outbreak investigation and epidemiological studies

Environmental Microbiology

  • Biodiversity assessment
  • Ecosystem health monitoring
  • Biogeochemical cycle studies
  • Contamination source tracking

FAQs

What does “Unmapped species” mean in the reported profile?

What does “Unclassified species” mean in the reported profile?