FastQC at OSC

class:inverse middle center

# Interpreting FastQC output

----

### Jelmer Poelstra, MCIC Wooster
### 2021/02/05 (updated: 2021-02-05)

---
class: inverse middle center
name: fastqc-output

# FastQC output: "module" by module

----

.left[
### Two useful sources of information: 
- ### [FastQC page for each module](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/) (Looks weird but works!)
- ### [MSU FastQC tutorial and FAQ](https://rtsf.natsci.msu.edu/sites/_rtsf/assets/File/FastQC_TutorialAndFAQ_080717.pdf)
]

---

## Summary and overview of modules

---

## Module 1: Basic statistics

---

## Module 2: Per-base quality along the read

- In a FASTQ file, every single base has a quality score.
  These figures visualize the mean per-base quality score along the length
  of the read.

.pull-left[
**Good / OK:**

<img src=img/fastqc_good.png width="100%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_bad.png width="100%">

]

.content-box-info[
- A decrease in sequence quality along the reads is normal.
- R2 (reverse) reads are usually worse than R1 (forward) reads.
]

---

## Module 3: Per-sequence quality scores

- Quality scores averaged over the full sequence.

.pull-left[
**Good:**

<img src=img/fastqc_mod04_per-seq-qual_good.png width="100%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_mod04_per-seq-qual_bad.png width="100%">

]

---

## Module 4: Per-base sequence content

.pull-left[
**Good:**

<img src=img/fastqc_mod05_per-base-content_good.png width="90%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_mod05_per-base-content_bad.png width="90%">

]

> It's worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module.
> &mdash; [source](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/4%20Per%20Base%20Sequence%20Content.html)

---

## Module 5: Per-sequence GC content

.pull-left[
**Good:**

<img src=img/fastqc_mod06_GC_good.png width="100%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_mod06_GC_bad.png width="100%">

]

.content-box-info[
- An unusual distribution could indicate contamination.

- The expected distribution is for whole-genome shotgun sequencing &ndash;
  it is normal for RNA-seq data to have a narrower distribution.
]

---

## Module 6: Per-base N content

- Quantifies the percentage of uncalled bases (N) across the read.

.pull-left[
**Good:**

<img src=img/fastqc_mod07_N_good.png width="90%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_mod07_N_bad.png width="90%">

]

.content-box-info[
- Ns may become more common at the end of the read,  
  and at the start of the read for highly biased libraries.
  
- A peak like in the fig. on the right indicates a problem with a specific
  cycle in the Illumina run.
]

---

## Module 7: Sequence length distribution

.center[
**Warning:**
]

.content-box-info[
Will throw a warning as soon as not all sequences are of the same length,
but this is quite normal.
]
  
---

## Module 8: Sequence duplication levels

- Checks, *for a subset of sequences*, how many duplicates
  (= identical sequences) are present.
  
.pull-left[
**Good:**

<figure>

<img src=img/fastqc_mod09_seqdup_good.png width="90%">
<figcaption><a href="https://rtsf.natsci.msu.edu/sites/_rtsf/assets/File/FastQC_TutorialAndFAQ_080717.pdf">Figure source</a> </figcaption>

</figure>
]

.pull-right[
**"Bad":**

]

.content-box-info[
- Often throws a warning for RNA-seq data, which can be ignored,
  as these represent highly expressed transcripts.
  
- Pay attention to the blue line (red line can mostly be ignored).
]

---

## Module 9: Overrepresented sequences

- Returns a *Warning* if any sequence is >0.1% of total.
- Returns *Failure* if any sequence is >1% of total.

---

## Module 10: Adapter content

- Checks for known adapter sequences

.pull-left[
**Good:**

<img src=img/fastqc_mod11_adapter_good.png width="100%">

]

.pull-right[
**Bad:**

<figure>

<img src=img/fastqc_mod11_adapter_bad.png width="100%">
<figcaption><a href="https://rtsf.natsci.msu.edu/sites/_rtsf/assets/File/FastQC_TutorialAndFAQ_080717.pdf">Figure source</a> </figcaption>

</figure>
]

.content-box-info[
When some of the insert sizes are shorter than the read length,
adapters can end up in the sequence &ndash; these should be removed!
]

---

## Module 11: K-mer content

- Another way to check for duplicated sequences,
  especially in the presence of sequencing error.

---

## Module 12: Per-tile sequence quality

### What is a tile?

---

## Module 12: Per-tile sequence quality (cont.)

.pull-left[
**Good:**

<img src=img/fastqc_mod03_per-tile_good.png width="100%">

]

.pull-right[
**Bad:**

<img src=img/fastqc_mod03_per-tile_bad.png width="100%">

]

---

## Let's run FastQC

- Go to your directory:

```sh
  $ cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER
  ```

- **Run the script:**

```sh
  $ scripts/QC_fastq/fastqc_dir.sh data/fastq results/QC_fastq
  ```

- Check the output:

```sh
  $ ls
  $ less slurm*
  
  $ ls -lh results/QC_fastq
  ```