
How does Cosmos-Hub’s Emu implementation differ from the standard Emu pipeline?The Cosmos-Hub implementation includes several enhancements over the standard Emu parameters:
- Optimal performance across diverse sample types through improved fine tuning of minimap2 parameters
- Enhanced EM Configuration via probability and iteration parameters for improved convergence and reduced false positives
- Multi-database support with parallel processing across specialized reference databases
- End-to-end automation from raw reads to publication-ready taxonomic profiles
- Advanced quality control with comprehensive filtering statistics and visualizations
- Parameter-rich customization tailored to your specific study requirements
Required Parameters
There are a few preprocessing parameters that are required to run the long-read amplicon pipeline:- Database Selection: select the database most appropriate for your sample type (up to 2, see below for definitions of databases)
- Minimum/Maximum Read Length: Should be adjusted based on amplicon region (V1-V8 vs V1-V9) and expected sequence quality. Minimum length should range between ~800 (archaea-enriched or low-quality sequenced samples) and ~1200 bases for single gene sequencing, and ~3500 for full operon sequencing.
- Minimum Read Quality: Recommended thresholds based on data source:
- High quality data (expected from Cmbio): Q-score ≥ 20
- Medium quality data: Q-score 17-20
- Lower quality data: Q-score 15-17
- Alignment Preset Settings: choose the sequencing platform used to produce the data among ONT or PacBio
- Maximum EM Iterations: Numbers above 20 (recommended), in most cases do not lead to significant precision improvement, while causing longer running times. Choices of higher values (e.g. 30/40) could be beneficial in case of complex communities such as soil samples.
- Probability cutoff: reduce false positive hits in case of bad primary alignments. Too high values lead to overestimation of the most abundant species and can be supported in case of simple and well defined communities, with taxa phylogenetically distant (e.g. mock communities).
Type of Sequences | HQ 16S bacterial V1-V8 | HQ 16S bacterial V1-V9 | MQ 16S bacterial V1-V9 | HQ 18S eukaryotic V4-V9 | HQ ITS Fungal ITS1 / ITS2 | Full Operon 16S-ITS-23S rRNA operon |
---|---|---|---|---|---|---|
HUB Workflow Name | Long Read 16S/18S Amplicon Profiling | Long Read 16S/18S Amplicon Profiling | Long Read 16S/18S Amplicon Profiling | Long Read 16S/18S Amplicon Profiling | Long Read ITS Amplicon Profiling | Full Length rRNA Amplicon (16S-ITS-23S) Profiling (Beta) |
Min. read length | 1100 | 1200 | 1000 | 500 | 100 | 2500 |
Max. read length | 2000 | 2500 | 10000 | 2000 | 5000 | 10000 |
Min. read quality | 20 | 20 | 17 | 20 | 20 | 20 |
Max. EM iterations | 20 | 20 | 25 | 25 | 25 | 20 |
Probability cutoff | 0.95 | 0.95 | 0.9 | 0.9 | 0.9 | 0.9 |
Reference DB | GTDB | GTDB | GTDB | SILVA | UNITE | MIrROR |
Database Selection
Emu supports multiple curated databases, each optimized for specific sample types and research applications:Most users should use GTDB + GreenGenes2 as the default combination.
- SILVA r138.1: Comprehensive, manually curated database ideal for environmental samples
- GTDB r220: Genome Taxonomy Database with standardized prokaryotic taxonomy
- Greengenes2 202210: Updated classification system with improved phylogenetic framework (not updated to newest version 202409 because of problems with important taxonomies, such as the lack of reference sequences labelled as Escherichia coli https://github.com/biocore/q2-greengenes2/issues/29)
- HOMD v15.23: Specialized for oral microbiome studies
- MiDAS v5.3: Optimized for activated sludge and wastewater treatment systems
- UNITE v10.0: Fungal ITS sequences for multi-kingdom studies
- MIrROR v2.0: Ribosomal RNA gene database (16S-ITS-23S rRNA operon sequences extracted from bacterial genomes)
- Emu Default v3.4.5: Curated combination of rrnDB v5.6 and NCBI 16S RefSeq
Sample-Specific Database Recommendations
### Sample Type/Environment | ### Recommended Database(s) |
---|---|
Human gut microbiome | GTDB, Greengenes2, eHOMDB (for oral-associated strains) |
Human skin microbiome | GTDB, Greengenes2 |
Human vaginal microbiome | GTDB, Greengenes2 |
Oral microbiome | HOMD, SILVA |
Soil/Environmental samples | SILVA, GTDB, Greengenes2 |
Aquatic (marine/freshwater) | SILVA, GTDB, Greengenes2 |
Activated sludge/wastewater | MiDAS |
Fungal-dominated samples | UNITE (for ITS), SILVA (for multi-kingdom) |
Poorly characterized biomes | Combine GTDB + SILVA or GTDB + Greengenes2 |
Fast analysis of simple communities | Emu default |
Key Benefits Over Short-Read Approaches
Species-Level Resolution: Full-length gene coverage enables discrimination between highly similar speciesReduced Misclassification: Complete sequence information minimizes ambiguous taxonomic assignment.Enhanced Phylogenetic Resolution: Critical for distinguishing closely related species in complex microbiomesImproved Rare Taxa Detection: Better sensitivity for low-abundance community membersCore Methodology
The Emu workflow implements a sophisticated two-stage process optimized for both high-quality and error-prone long-read data:1
Stage 1: Alignment Generation
The pipeline begins by generating alignments between input reads and the supplied reference database using minimap2, a versatile pairwise aligner optimized for long sequences. Minimap2 employs a seed-chain-align procedure that:
- Collects minimizers from reference sequences and indexes them in a hash table
- Identifies exact matches (anchors) between query and reference sequences
- Chains collinear anchors and applies dynamic programming for base-level alignment
- Accounts for the error profiles specific to ONT and PacBio sequencing platforms
2
Stage 2: Expectation-Maximization (EM) Error Correction
The core innovation of Emu lies in its EM-based error correction algorithm:
Initial Probabilities:
Initial Probabilities:
- Establishes alignment likelihoods P(r|t) between each read r and taxonomy in reference database
- Calculates probabilities for nucleotide alignment types: mismatch (X), insertion (I), deletion (D), and softclip (S)
- Initializes sample composition vector F with uniform distribution
Iterative Refinement
Iterative Refinement
- E-step (Expectation): Computes the probability that each read originated from each species P(t|r) using Bayes’ theorem
- M-step (Maximization): Updates species abundance estimates F(t) based on read assignment probabilities
- Evaluates total log likelihood L(R) and continues iteration until convergence (improvement < 0.01)
Final Processing
Final Processing
- Applies abundance threshold filtering to remove false positives (default: 1 read for <1000 reads, 10 reads for larger samples)
- Performs final redistribution to generate species-level abundance profiles
Performance Benchmarking
Extensive validation studies demonstrate Emu’s superior performance compared to alternative long-read taxonomic profilers:- L1-norm Error: Consistently lowest relative abundance error across test datasets
- Precision/Recall: Optimal balance between true positive detection and false positive control
- Species Detection: Superior identification of closely related species (e.g., 97% ANI discrimination)
Data Upload and Analysis
Supported Input Formats
- Raw FASTA and FASTQ files (compressed format supported: fastq,fastq.gz,fastq.bz2,fasta,fastq,fastq.gz,fastq.bz2,fq,fq.gz)
- Demultiplexed long-read sequences
- Both single-end and paired-end data
Sample Assessment Features
- Real-time read quality visualization
- Sequence length distribution analysis
- Taxonomic assignment confidence scoring
- Comparative analysis with other samples
Results Format
- 7 Taxonomic-level (down to species) abundance profiles
- Taxonomic classification tables
- Quality control metrics
- Interactive visualization dashboards
Comparative Analysis Integration:
Results seamlessly integrate with Cosmos-Hub’s comparative analysis tools for multi-sample comparison and statistical analysis. Data is exportable in standard formats along with publication-ready vizualizations.Methodology publications and validation studies
- Curry, K.D., et al. “Emu: Species-Level Microbial Community Profiling for Full-Length Nanopore 16S Reads.” Nature Methods (2022)
- Li, H. “Minimap2: pairwise alignment for nucleotide sequences.” Bioinformatics (2018)
Applications and Use Cases
Microbiome Research
- Species-level community profiling
- Functional pathway prediction
- Longitudinal microbiome dynamics
- Multi-kingdom microbiome analysis
Clinical Microbiology
- Rapid pathogen identification from clinical specimens
- Outbreak investigation and epidemiological studies
Environmental Microbiology
- Biodiversity assessment
- Ecosystem health monitoring
- Biogeochemical cycle studies
- Contamination source tracking
FAQs
What does “Unmapped species” mean in the reported profile?
What does “Unmapped species” mean in the reported profile?
“Unmapped species” represent the portion of the metagenome that the workflow has not been able to recognize in the reference. More precisely, these are the sequenced reads that have no successful alignment (i.e. not passing the quality thresholds) against the database sequences.
What does “Unclassified species” mean in the reported profile?
What does “Unclassified species” mean in the reported profile?
“Unclassified species” represent, instead, the portion of species that are likely not present. This is a particular feature of Emu that defines “Unclassified mapped reads” as those mapped only to database sequences of species that are presumed to not be present (likely due to low overall abundance).