Sequencing QC metrics

During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence or commonly known as read is generated which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand and identify error-types that may impact the interpretation of downstream analysis. In order to do quality control on sequencing reads, CosmosID-HUB Microbiome has integrated FastQC to generate a comprehensive QC report which can spot problems which originate either in the sequencer or in the starting library material.

The different analysis modules of fastQC(1) that are supported on the hub are

  1. Basic Statistics - The Basic Statistics module generates some simple composition statistics for the file analysed.

  2. Per Base Sequence Quality - The plot shows an overview of the range of quality values across all bases at each position in the FastQ file.

  3. Per Sequence Quality Score - The per sequence quality score report allows you to see if a subset of your sequences have universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences.

  4. Per Base Sequence Content - Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called

  5. Per Sequence GC Content - This module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content.

  6. Per Base N Content - If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call. This module plots out the percentage of base calls at each position for which an N was called.

  7. Sequence Length Distribution - Some high throughput sequencers generate sequence fragments of uniform length, but others can contain reads of wildly varying lengths. This module generates a graph showing the distribution of fragment sizes in the file which was analysed.

  8. Sequence Duplication Levels - In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).
    This module counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication.

  9. Adapter Content - The plot shows a cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position. Once a sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages you see will only increase as the read length goes on.

Where can I view the FastQC results on the Hub?

The FastQC results are available from the results dropdown menu on the single sample explorer page.

How can I aggregate the sequencing QC results of multiple samples on the Hub?

The Comparative Analysis on CosmosID-HUB Microbiome allows users to aggregate the sequencing QC metrics using MultiQC(2) and explore the trends visually across all the samples of interest.

References

  1. Brabraham Institute. May 12, 2022. FastQC version 0.11.9. Source code, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  2. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32(19):3047-3048. doi:10.1093/bioinformatics/btw354