The Long-Read Amplicon Pipeline on Cosmos-Hub is an Emu-based algorithm that delivers species-level microbial community profiling from full-length 16S/18S/ITS rRNA genes sequences generated by long-read sequencing platforms like PacBio and Oxford Nanopore Technologies (ONT). Unlike traditional short-read 16S approaches that target only hypervariable regions (V3-V4), long-read amplicon sequencing captures the entire ~1,500 bp gene, including all nine hypervariable regions (V1-V9). This comprehensive coverage unlocks species- and subspecies-level resolution that short-read methods simply cannot achieve.

How does Cosmos-Hub’s Emu implementation differ from the standard Emu pipeline?The Cosmos-Hub implementation includes several enhancements over the standard Emu parameters:

Optimal performance across diverse sample types through improved fine tuning of minimap2 parameters
Enhanced EM Configuration via probability and iteration parameters for improved convergence and reduced false positives
Multi-database support with parallel processing across specialized reference databases
End-to-end automation from raw reads to publication-ready taxonomic profiles
Advanced quality control with comprehensive filtering statistics and visualizations
Parameter-rich customization tailored to your specific study requirements

Required Parameters

There are a few preprocessing parameters that are required to run the long-read amplicon pipeline:

Database Selection: select the database most appropriate for your sample type (up to 2, see below for definitions of databases)
Minimum/Maximum Read Length: Should be adjusted based on amplicon region (V1-V8 vs V1-V9) and expected sequence quality. Minimum length should range between ~800 (archaea-enriched or low-quality sequenced samples) and ~1200 bases for single gene sequencing, and ~3500 for full operon sequencing.
Minimum Read Quality: Recommended thresholds based on data source:
- High quality data (expected from Cmbio): Q-score ≥ 20
- Medium quality data: Q-score 17-20
- Lower quality data: Q-score 15-17
Alignment Preset Settings: choose the sequencing platform used to produce the data among ONT or PacBio
Maximum EM Iterations: Numbers above 20 (recommended), in most cases do not lead to significant precision improvement, while causing longer running times. Choices of higher values (e.g. 30/40) could be beneficial in case of complex communities such as soil samples.
Probability cutoff: reduce false positive hits in case of bad primary alignments. Too high values lead to overestimation of the most abundant species and can be supported in case of simple and well defined communities, with taxa phylogenetically distant (e.g. mock communities).

While more details on the choice of the reference database are reported in the next section, here we suggest the combinations of parameters for the most common user cases:

Type of Sequences	HQ 16S bacterial V1-V8	HQ 16S bacterial V1-V9	MQ 16S bacterial V1-V9	HQ 18S eukaryotic V4-V9	HQ ITS Fungal ITS1 / ITS2	Full Operon 16S-ITS-23S rRNA operon
HUB Workflow Name	Long Read 16S/18S Amplicon Profiling	Long Read 16S/18S Amplicon Profiling	Long Read 16S/18S Amplicon Profiling	Long Read 16S/18S Amplicon Profiling	Long Read ITS Amplicon Profiling	Full Length rRNA Amplicon (16S-ITS-23S) Profiling (Beta)
Min. read length	1100	1200	1000	500	100	2500
Max. read length	2000	2500	10000	2000	5000	10000
Min. read quality	20	20	17	20	20	20
Max. EM iterations	20	20	25	25	25	20
Probability cutoff	0.95	0.95	0.9	0.9	0.9	0.9
Reference DB	GTDB	GTDB	GTDB	SILVA	UNITE	MIrROR

Why no primer trimming or barcode/adapter removal?

Long-Read Amplicon Profiling accepts raw FASTA/FASTQ files without built-in primer trimming, barcode demultiplexing, or adapter removal, unlike short-read 16S pipelines.Short-read methods require primer trimming for DADA2-based ASV reconstruction and optimal taxa assignment, but long-read mapping (minimap2 + EM error correction) gains no accuracy benefit from it.Users should pre-process barcodes/adapters externally if needed; library kit choice typically does not affect downstream mapping as long as reads meet quality/length thresholds.

Database Selection

Emu supports multiple curated databases, each optimized for specific sample types and research applications:

Most users should use GTDB + GreenGenes2 as the default combination.

Available Databases

SILVA r138.1: Comprehensive, manually curated database ideal for environmental samples
GTDB r220: Genome Taxonomy Database with standardized prokaryotic taxonomy
Greengenes2 202210: Updated classification system with improved phylogenetic framework (not updated to newest version 202409 because of problems with important taxonomies, such as the lack of reference sequences labelled as Escherichia coli https://github.com/biocore/q2-greengenes2/issues/29)
HOMD v15.23: Specialized for oral microbiome studies
MiDAS v5.3: Optimized for activated sludge and wastewater treatment systems
UNITE v10.0: Fungal ITS sequences for multi-kingdom studies
MIrROR v2.0: Ribosomal RNA gene database (16S-ITS-23S rRNA operon sequences extracted from bacterial genomes)
Emu Default v3.4.5: Curated combination of rrnDB v5.6 and NCBI 16S RefSeq

Sample-Specific Database Recommendations

### Sample Type/Environment	### Recommended Database(s)
Human gut microbiome	GTDB, Greengenes2, eHOMDB (for oral-associated strains)
Human skin microbiome	GTDB, Greengenes2
Human vaginal microbiome	GTDB, Greengenes2
Oral microbiome	HOMD, SILVA
Soil/Environmental samples	SILVA, GTDB, Greengenes2
Aquatic (marine/freshwater)	SILVA, GTDB, Greengenes2
Activated sludge/wastewater	MiDAS
Fungal-dominated samples	UNITE (for ITS), SILVA (for multi-kingdom)
Poorly characterized biomes	Combine GTDB + SILVA or GTDB + Greengenes2
Fast analysis of simple communities	Emu default

Key Benefits Over Short-Read Approaches

Species-Level Resolution: Full-length gene coverage enables discrimination between highly similar speciesReduced Misclassification: Complete sequence information minimizes ambiguous taxonomic assignment.Enhanced Phylogenetic Resolution: Critical for distinguishing closely related species in complex microbiomesImproved Rare Taxa Detection: Better sensitivity for low-abundance community members

Core Methodology

The Emu workflow implements a sophisticated two-stage process optimized for both high-quality and error-prone long-read data:

Stage 1: Alignment Generation

The pipeline begins by generating alignments between input reads and the supplied reference database using minimap2, a versatile pairwise aligner optimized for long sequences. Minimap2 employs a seed-chain-align procedure that:

Collects minimizers from reference sequences and indexes them in a hash table
Identifies exact matches (anchors) between query and reference sequences
Chains collinear anchors and applies dynamic programming for base-level alignment
Accounts for the error profiles specific to ONT and PacBio sequencing platforms

Stage 2: Expectation-Maximization (EM) Error Correction

The core innovation of Emu lies in its EM-based error correction algorithm:

Initial Probabilities:

Establishes alignment likelihoods P(r|t) between each read r and taxonomy in reference database
Calculates probabilities for nucleotide alignment types: mismatch (X), insertion (I), deletion (D), and softclip (S)
Initializes sample composition vector F with uniform distribution

Iterative Refinement

E-step (Expectation): Computes the probability that each read originated from each species P(t|r) using Bayes’ theorem
M-step (Maximization): Updates species abundance estimates F(t) based on read assignment probabilities
Evaluates total log likelihood L(R) and continues iteration until convergence (improvement < 0.01)

Final Processing

Applies abundance threshold filtering to remove false positives (default: 1 read for <1000 reads, 10 reads for larger samples)
Performs final redistribution to generate species-level abundance profiles

Performance Benchmarking

Extensive validation studies demonstrate Emu’s superior performance compared to alternative long-read taxonomic profilers:

L1-norm Error: Consistently lowest relative abundance error across test datasets
Precision/Recall: Optimal balance between true positive detection and false positive control
Species Detection: Superior identification of closely related species (e.g., 97% ANI discrimination)

Data Upload and Analysis

Supported Input Formats

Raw FASTA and FASTQ files (compressed format supported: fastq,fastq.gz,fastq.bz2,fasta,fastq,fastq.gz,fastq.bz2,fq,fq.gz)
Demultiplexed long-read sequences
Both single-end and paired-end data

Sample Assessment Features

Real-time read quality visualization
Sequence length distribution analysis
Taxonomic assignment confidence scoring
Comparative analysis with other samples

Results Format

7 Taxonomic-level (down to species) abundance profiles
Taxonomic classification tables
Quality control metrics
Interactive visualization dashboards

Comparative Analysis Integration:

Results seamlessly integrate with Cosmos-Hub’s comparative analysis tools for multi-sample comparison and statistical analysis. Data is exportable in standard formats along with publication-ready vizualizations.

Methodology publications and validation studies

Curry, K.D., et al. “Emu: Species-Level Microbial Community Profiling for Full-Length Nanopore 16S Reads.” Nature Methods (2022)
Li, H. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics (2018)

Applications and Use Cases

Microbiome Research

Species-level community profiling
Functional pathway prediction
Longitudinal microbiome dynamics
Multi-kingdom microbiome analysis

Clinical Microbiology

Rapid pathogen identification from clinical specimens
Outbreak investigation and epidemiological studies

Environmental Microbiology

Biodiversity assessment
Ecosystem health monitoring
Biogeochemical cycle studies
Contamination source tracking

FAQs

What does “Unmapped species” mean in the reported profile?

“Unmapped species” represent the portion of the metagenome that the workflow has not been able to recognize in the reference. More precisely, these are the sequenced reads that have no successful alignment (i.e. not passing the quality thresholds) against the database sequences.

What does “Unclassified species” mean in the reported profile?

“Unclassified species” represent, instead, the portion of species that are likely not present. This is a particular feature of Emu that defines “Unclassified mapped reads” as those mapped only to database sequences of species that are presumed to not be present (likely due to low overall abundance).

Using the Hub

Navigating the Sample Dashboard

Interpreting Profiling Results

Comparative Analysis and Statistics

Microbiome Profiling

Technical Appendix

Long-Read Amplicon Profiling

Required Parameters

Why no primer trimming or barcode/adapter removal?

Database Selection

Sample-Specific Database Recommendations

Key Benefits Over Short-Read Approaches

Core Methodology

Performance Benchmarking

Data Upload and Analysis

Supported Input Formats

Sample Assessment Features

Results Format

Comparative Analysis Integration:

Methodology publications and validation studies

Applications and Use Cases

Microbiome Research

Clinical Microbiology

Environmental Microbiology

FAQs

Using the Hub

Navigating the Sample Dashboard

Interpreting Profiling Results

Comparative Analysis and Statistics

Microbiome Profiling

Technical Appendix

​Required Parameters

​Why no primer trimming or barcode/adapter removal?

​Database Selection

​Sample-Specific Database Recommendations

​Key Benefits Over Short-Read Approaches

​Core Methodology

​Performance Benchmarking

​Data Upload and Analysis

​Supported Input Formats

​Sample Assessment Features

​Results Format

​Comparative Analysis Integration:

​Methodology publications and validation studies

​Applications and Use Cases

Microbiome Research

Clinical Microbiology

Environmental Microbiology

​FAQs

Required Parameters

Why no primer trimming or barcode/adapter removal?

Database Selection

Sample-Specific Database Recommendations

Key Benefits Over Short-Read Approaches

Core Methodology

Performance Benchmarking

Data Upload and Analysis

Supported Input Formats

Sample Assessment Features

Results Format

Comparative Analysis Integration:

Methodology publications and validation studies

Applications and Use Cases

FAQs