+ - 0:00:00
Notes for current slide
Notes for next slide

Interpreting FastQC output





Jelmer Poelstra, MCIC Wooster

2021/02/05 (updated: 2021-02-05)

1 / 17

FastQC output: "module" by module



Two useful sources of information:



2 / 17

Summary and overview of modules

3 / 17

Module 1: Basic statistics

4 / 17

Module 2: Per-base quality along the read

  • In a FASTQ file, every single base has a quality score. These figures visualize the mean per-base quality score along the length of the read.

Good / OK:

Bad:

  • A decrease in sequence quality along the reads is normal.
  • R2 (reverse) reads are usually worse than R1 (forward) reads.
5 / 17

Module 3: Per-sequence quality scores

  • Quality scores averaged over the full sequence.

Good:

Bad:

6 / 17

Module 4: Per-base sequence content

Good:

Bad:

It's worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module. — source

7 / 17

Module 5: Per-sequence GC content

Good:

Bad:

  • An unusual distribution could indicate contamination.

  • The expected distribution is for whole-genome shotgun sequencing – it is normal for RNA-seq data to have a narrower distribution.

8 / 17

Module 6: Per-base N content

  • Quantifies the percentage of uncalled bases (N) across the read.

Good:

Bad:

  • Ns may become more common at the end of the read,
    and at the start of the read for highly biased libraries.

  • A peak like in the fig. on the right indicates a problem with a specific cycle in the Illumina run.

9 / 17

Module 7: Sequence length distribution

Warning:

Will throw a warning as soon as not all sequences are of the same length, but this is quite normal.

10 / 17

Module 8: Sequence duplication levels

  • Checks, for a subset of sequences, how many duplicates (= identical sequences) are present.

"Bad":

  • Often throws a warning for RNA-seq data, which can be ignored, as these represent highly expressed transcripts.

  • Pay attention to the blue line (red line can mostly be ignored).

11 / 17

Module 9: Overrepresented sequences

  • Returns a Warning if any sequence is >0.1% of total.
  • Returns Failure if any sequence is >1% of total.
12 / 17

Module 10: Adapter content

  • Checks for known adapter sequences

Good:

When some of the insert sizes are shorter than the read length, adapters can end up in the sequence – these should be removed!

13 / 17

Module 11: K-mer content

  • Another way to check for duplicated sequences, especially in the presence of sequencing error.
14 / 17

Module 12: Per-tile sequence quality

What is a tile?

15 / 17

Module 12: Per-tile sequence quality (cont.)

Good:

Bad:

16 / 17

Let's run FastQC

  • Go to your directory:

    $ cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER
  • Run the script:

    $ scripts/QC_fastq/fastqc_dir.sh data/fastq results/QC_fastq
  • Check the output:

    $ ls
    $ less slurm*
    $ ls -lh results/QC_fastq
17 / 17

FastQC output: "module" by module



Two useful sources of information:



2 / 17
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow