The Long-Read Amplicon Pipeline on Cosmos-Hub is an Emu-based algorithm that delivers species-level microbial community profiling from full-length 16S/18S/ITS rRNA genes sequences generated by long-read sequencing platforms like PacBio and Oxford Nanopore Technologies (ONT). Unlike traditional short-read 16S approaches that target only hypervariable regions (V3-V4), long-read amplicon sequencing captures the entire ~1,500 bp gene, including all nine hypervariable regions (V1-V9). This comprehensive coverage unlocks species- and subspecies-level resolution that short-read methods simply cannot achieve. LR16 Pn
How does Cosmos-Hub’s Emu implementation differ from the standard Emu pipeline?The Cosmos-Hub implementation includes several enhancements over the standard Emu parameters:
  • Optimal performance across diverse sample types through improved fine tuning of minimap2 parameters
  • Enhanced EM Configuration via probability and iteration parameters for improved convergence and reduced false positives
  • Multi-database support with parallel processing across specialized reference databases
  • End-to-end automation from raw reads to publication-ready taxonomic profiles
  • Advanced quality control with comprehensive filtering statistics and visualizations
  • Parameter-rich customization tailored to your specific study requirements

Required Parameters

There are a few preprocessing parameters that are required to run the long-read amplicon pipeline:
  1. Database Selection: select the database most appropriate for your sample type (up to 2, see below for definitions of databases)
  2. Minimum/Maximum Read Length: Should be adjusted based on amplicon region (V1-V8 vs V1-V9) and expected sequence quality. Minimum length should range between ~800 (archaea-enriched or low-quality sequenced samples) and ~1200 bases for single gene sequencing, and ~3500 for full operon sequencing.
  3. Minimum Read Quality: Recommended thresholds based on data source:
    • High quality data (expected from Cmbio): Q-score ≥ 20
    • Medium quality data: Q-score 17-20
    • Lower quality data: Q-score 15-17
  4. Alignment Preset Settings: choose the sequencing platform used to produce the data among ONT or PacBio
  5. Maximum EM Iterations: Numbers above 20 (recommended), in most cases do not lead to significant precision improvement, while causing longer running times. Choices of higher values (e.g. 30/40) could be beneficial in case of complex communities such as soil samples.
  6. Probability cutoff: reduce false positive hits in case of bad primary alignments. Too high values lead to overestimation of the most abundant species and can be supported in case of simple and well defined communities, with taxa phylogenetically distant (e.g. mock communities).
While more details on the choice of the reference database are reported in the next section, here we suggest the combinations of parameters for the most common user cases:
Type of SequencesHQ 16S bacterial V1-V8HQ 16S bacterial V1-V9MQ 16S bacterial V1-V9HQ 18S eukaryotic V4-V9HQ ITS Fungal ITS1 / ITS2Full Operon 16S-ITS-23S rRNA operon
HUB Workflow NameLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read 16S/18S Amplicon ProfilingLong Read ITS Amplicon ProfilingFull Length rRNA Amplicon (16S-ITS-23S) Profiling (Beta)
Min. read length1100120010005001002500
Max. read length20002500100002000500010000
Min. read quality202017202020
Max. EM iterations202025252520
Probability cutoff0.950.950.90.90.90.9
Reference DBGTDBGTDBGTDBSILVAUNITEMIrROR

Database Selection

Emu supports multiple curated databases, each optimized for specific sample types and research applications:
Most users should use GTDB + GreenGenes2 as the default combination.
Available Databases
  • SILVA r138.1: Comprehensive, manually curated database ideal for environmental samples
  • GTDB r220: Genome Taxonomy Database with standardized prokaryotic taxonomy
  • Greengenes2 202210: Updated classification system with improved phylogenetic framework (not updated to newest version 202409 because of problems with important taxonomies, such as the lack of reference sequences labelled as Escherichia coli https://github.com/biocore/q2-greengenes2/issues/29)
  • HOMD v15.23: Specialized for oral microbiome studies
  • MiDAS v5.3: Optimized for activated sludge and wastewater treatment systems
  • UNITE v10.0: Fungal ITS sequences for multi-kingdom studies
  • MIrROR v2.0: Ribosomal RNA gene database (16S-ITS-23S rRNA operon sequences extracted from bacterial genomes)
  • Emu Default v3.4.5: Curated combination of rrnDB v5.6 and NCBI 16S RefSeq
Sample-Specific Database Recommendations
Sample Type/EnvironmentRecommended Database(s)
Human gut microbiomeGTDB, Greengenes2, eHOMDB (for oral-associated strains)
Human skin microbiomeGTDB, Greengenes2
Human vaginal microbiomeGTDB, Greengenes2
Oral microbiomeHOMD, SILVA
Soil/Environmental samplesSILVA, GTDB, Greengenes2
Aquatic (marine/freshwater)SILVA, GTDB, Greengenes2
Activated sludge/wastewaterMiDAS
Fungal-dominated samplesUNITE (for ITS), SILVA (for multi-kingdom)
Poorly characterized biomesCombine GTDB + SILVA or GTDB + Greengenes2
Fast analysis of simple communitiesEmu default

Core Methodology

The Emu workflow implements a sophisticated two-stage process optimized for both high-quality and error-prone long-read data:
  • Stage 1: Alignment Generation
    • The pipeline begins by generating alignments between input reads and the supplied reference database using minimap2, a versatile pairwise aligner optimized for long sequences. Minimap2 employs a seed-chain-align procedure that:
      • Collects minimizers from reference sequences and indexes them in a hash table
      • Identifies exact matches (anchors) between query and reference sequences
      • Chains collinear anchors and applies dynamic programming for base-level alignment
      • Accounts for the error profiles specific to ONT and PacBio sequencing platforms
  • Stage 2: Expectation-Maximization (EM) Error Correction
    • The core innovation of Emu lies in its EM-based error correction algorithm:
    • Initial Probabilities:
      • Establishes alignment likelihoods P(r|t) between each read r and taxonomy in reference database
      • Calculates probabilities for nucleotide alignment types: mismatch (X), insertion (I), deletion (D), and softclip (S)
      • Initializes sample composition vector F with uniform distribution
    • Iterative Refinement:
      • E-step (Expectation): Computes the probability that each read originated from each species P(t|r) using Bayes’ theorem
      • M-step (Maximization): Updates species abundance estimates F(t) based on read assignment probabilities
      • Evaluates total log likelihood L(R) and continues iteration until convergence (improvement < 0.01)
    • Final Processing:
      • Applies abundance threshold filtering to remove false positives (default: 1 read for <1000 reads, 10 reads for larger samples)
      • Performs final redistribution to generate species-level abundance profiles

Performance Benchmarking

Extensive validation studies demonstrate Emu’s superior performance compared to alternative long-read taxonomic profilers:
  • L1-norm Error: Consistently lowest relative abundance error across test datasets
  • Precision/Recall: Optimal balance between true positive detection and false positive control
  • Species Detection: Superior identification of closely related species (e.g., 97% ANI discrimination)

Applications and Use Cases

Clinical Microbiology
  • Rapid pathogen identification from clinical specimens
  • Outbreak investigation and epidemiological studies
Microbiome Research
  • Species-level community profiling
  • Functional pathway prediction
  • Longitudinal microbiome dynamics
  • Multi-kingdom microbiome analysis
Environmental Monitoring
  • Biodiversity assessment
  • Ecosystem health monitoring
  • Biogeochemical cycle studies
  • Contamination source tracking

Key Benefits Over Short-Read Approaches

  • Species-Level Resolution: Full-length gene coverage enables discrimination between highly similar species
  • Reduced Misclassification: Complete sequence information minimizes ambiguous taxonomic assignment.
  • Enhanced Phylogenetic Resolution: Critical for distinguishing closely related species in complex microbiomes
  • Improved Rare Taxa Detection: Better sensitivity for low-abundance community members

Data Upload and Analysis

Supported Input Formats
  • Raw FASTA and FASTQ files (compressed format supported: fastq,fastq.gz,fastq.bz2,fasta,fastq,fastq.gz,fastq.bz2,fq,fq.gz)
  • Demultiplexed long-read sequences
  • Both single-end and paired-end data
  • Direct upload from sequencing platforms
Quality Assessment Features
  • Real-time read quality visualization
  • Sequence length distribution analysis
  • Taxonomic assignment confidence scoring
  • Comparative analysis with other samples
Results Format
  • 7 Taxonomic-level (down to species) abundance profiles
  • Taxonomic classification tables
  • Quality control metrics
  • Interactive visualization dashboards
Comparative Analysis Integration: Results seamlessly integrate with Cosmos-Hub’s comparative analysis tools for:
  • Multi-sample comparison
  • Statistical analysis
  • Publication-ready visualizations
  • Data export in standard formats

Additional methodology papers and validation studies

  • Curry, K.D., et al. “Emu: Species-Level Microbial Community Profiling for Full-Length Nanopore 16S Reads.” Nature Methods (2022)
  • Li, H. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics (2018)

FAQs

  • What does “Unmapped species” mean in the reported profile?
    “Unmapped species” represent the portion of the metagenome that the workflow has not been able to recognize in the reference. More precisely, these are the sequenced reads that have no successful alignment (i.e. not passing the quality thresholds) against the database sequences.
  • What does “Unclassified species” mean in the reported profile?
    “Unclassified species” represent, instead, the portion of species that are likely not present. This is a particular feature of Emu that defines “Unclassified mapped reads” as those mapped only to database sequences of species that are presumed to not be present (likely due to low overall abundance).