RNAseq read alignment with STAR

class:inverse middle center

# RNAseq read alignment with STAR

----

### Jelmer Poelstra, MCIC Wooster
### 2021/02/26 (updated: 2021-03-04)
  
---

## STAR options for alignment

Some options for aligning reads to the genome using STAR are  
(_not showing options that were shown above for indexing_):

| Option | Meaning
|---------------------------|------------
| `--readFilesIn` | Path to FASTQ file
| `--outFileNamePrefix` | Path and (**sample**) prefix for output files
| `--outFilterMultimapNmax` | Max. nr. alignments for a read (default: 10)
| `--outSAMtype` | Output sorted BAM (default: SAM) `BAM SortedByCoordinate` for sorted BAM
| `--outSAMunmapped` | What to do with unmapped reads (default: don't output)
| `--alignIntronMin 5` | Minimum intron length* (default: 20)
|`--alignIntronMax 350000` | Maximum intron length* (default: 1 M)

* From [Clark et al. 2019](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6546887/):
> [Tomato] intron sizes varied from 5 to 313,176 bp with a mean value of 1,352 bp.

---

## The key parts of a script to map reads for 1 sample

```sh
#!/bin/bash
#SBATCH --account=PAS0471
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=18
module load star/2.6.0a

R1=$(ls "$fastq_dir"/*"$sample"*R1*fastq.gz)
R2=$(ls "$fastq_dir"/*"$sample"*R2*fastq.gz)

STAR --runThreadN "$SLURM_CPUS_ON_NODE" \
   --genomeDir "$index_dir" \
   --sjdbGTFtagExonParentTranscript "$ref_gff" \
   --readFilesIn "$R1" "$R2" \
   --readFilesCommand zcat \
   --outFileNamePrefix "$bam_dir/$sample"_ \
   --outSAMtype BAM SortedByCoordinate \
   --alignIntronMin 5 --alignIntronMax 350000
```

---

## Submit the script for one sample

We run the script for an arbitrary sample to see if it works:

```sh
# You should be here:
# /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER/

sample=C6_myb2

fastq_dir=data/fastq_concat_subsample 
index_dir=data/ref/Heinz1706_4.00/STAR_index
ref_gff=data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
bam_dir=results/align/bam

sbatch scripts/align/star_align.sh \
  "$sample" "$fastq_dir" "$index_dir" "$ref_gff" "$bam_dir"
```

---

## Check the output

First, let's have a look at the SLURM log of our script:

```sh
$ cat slurm-align-C6_myb2-12836540.out
#> Starting script star_align.sh
#> Mon Feb 15 20:47:51 EST 2021
#> -------------------
#> 
#> R1 input file: data/fastq_concat_subsample/C6_myb2_R1.fastq.gz
#> R2 input file: data/fastq_concat_subsample/C6_myb2_R2.fastq.gz
#> Genome index dir: data/ref/Heinz1706_4.00/STAR_index
#> Genome annotation file: data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
#> Bam (output) directory: results/align/bam
#> 
#> Running STAR....
#> Feb 15 20:47:51 ..... started STAR run
#> Feb 15 20:47:52 ..... loading genome
#> Feb 15 20:47:59 ..... started mapping
#> Feb 15 20:50:11 ..... started sorting BAM
#> Feb 15 20:50:19 ..... finished successfully
#> 
#> ----------------
#> Done with script star_align.sh
#> Mon Feb 15 20:50:19 EST 2021
```

---

## Check the output (cont.)

Next, let's check if we can see the BAM file and what its size is:

```sh
$ du -sh results/align/bam/*bam
#> 230M    results/align/bam/C6_myb2_Aligned.sortedByCoord.out.bam
```

We'll take a closer look at the STAR log files, which our script moved
to a sub-directory of the output dir:

```sh
$ ls results/align/bam/star_logs
#> results/align/bam/star_logs/C6_myb2_Log.final.out
#> results/align/bam/star_logs/C6_myb2_Log.out
#> results/align/bam/star_logs/C6_myb2_Log.progress.out
#> results/align/bam/star_logs/C6_myb2_SJ.out.tab
```

---

## Check the output (cont.)

The file `<sample>_Log` has a basic run log:

```sh
$ less results/align/bam/star_logs/C6_myb2_Log.out
# STAR version=STAR_2.6.0a
# STAR compilation time,server,dir=Mon Apr 23 13:19:26 EDT 2018 # florence.cshl.edu:/sonas-hs/gingeras/nlsas_norepl/user/dobin/STAR/STAR.master/source
# ##### DEFAULT parameters:
# versionSTAR                       20201
# versionGenome                     20101   20200   
# parametersFiles                   -   
# sysShell                          -
# runMode                           alignReads
# 
# ...
# 
# Loading Genome ... done! state: good=1 eof=0 fail=0 bad=0; loaded 784072704 bytes
# SA file size: 6455421245 bytes; state: good=1 eof=0 fail=0 bad=0
# Loading SA ... done! state: good=1 eof=0 fail=0 bad=0; loaded 6455421245 bytes
# Loading SAindex ... done: 391468491 bytes
# Finished loading the genome: Mon Feb 15 20:47:59 2021
# 
# alignIntronMax=alignMatesGapMax=0, the max intron size will be approximately determined by # (2^winBinNbits)*winAnchorDistNbins=589824
# Created thread # 1
# Created thread # 2
```

---

## Check the output (cont.)

The file `<sample>_Log.progress.out` has a log of mapping progress.

The mapping of this file apparently went so fast that only one row is shown,
but this file can show multiple rows with mappping stats as time progresses.

```sh
$ cat results/align/bam/star_logs/C6_myb2_Log.progress.out

#           Time    Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
#                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
# Feb 15 20:49:27     48.8     1192672      276    76.3%    277.8     0.2%     1.6%     0.1%     0.3%    19.1%     2.6%
# ALL DONE!
```

---

## Check the output (cont.)

The file `<sample>_SJ.out.tab` has a table of high-confidence splice junctions
(only those supported by *uniquely* mapping reads):

```sh
$ less results/align/bam/star_logs/C6_myb2_SJ.out.tab
# SL4.0ch00       18299   19023   1       1       0       1       0       12
# SL4.0ch00       1253757 1264285 2       2       0       0       1       46
# SL4.0ch00       1259569 1263678 2       2       0       1       0       47
# SL4.0ch00       1259569 1264285 2       2       0       1       0       26
# SL4.0ch00       1263726 1264285 2       2       0       2       1       47
# SL4.0ch00       1376108 1376905 1       5       0       0       1       33
# SL4.0ch00       1386429 1387446 1       1       0       0       1       55
# SL4.0ch00       1389552 1390217 0       0       0       0       18      66
# SL4.0ch00       1457286 1457951 0       0       0       0       18      65
# SL4.0ch00       1470754 1471439 1       5       0       0       1       64
# SL4.0ch00       1531568 1532365 1       5       0       0       1       33
# SL4.0ch00       1541889 1542906 1       1       0       0       1       55
# SL4.0ch00       1545012 1545677 0       0       0       0       18      66
# SL4.0ch00       1903425 1905479 2       2       0       0       1       53
# SL4.0ch00       1903425 1946574 2       2       0       0       1       53
# SL4.0ch00       2044995 2046903 0       0       0       0       6       32
# SL4.0ch00       2497375 2498040 0       0       0       0       18      65
# SL4.0ch00       2510840 2511525 1       5       0       0       1       64
# SL4.0ch00       2574688 2575704 2       2       0       0       1       55
```

---

## Check the output (cont.)

Most useful for now is the file `<sample>_Log.final.out`,
which has a summary of mapping statistics:

---

## Check the output (cont.)

While 76% of reads mapped uniquely (previous slide),
among the rest, we can see that "multi-mapped reads" account for a relatively
small proportion (<2%) while there are a lot (>20%) of unmapped reads.

Among unmapped reads, the vast majority (19% of the total number of reads)
are "too short". This does not necessarily mean that the read itself was too
short. More often, it will mean that the alignment that could be made was
too short.

```sh
# [...continuing results/align/bam/star_logs/C6_myb2_Log.final.out]
# MULTI-MAPPING READS:
# Number of reads mapped to multiple loci   | 42055
# % of reads mapped to multiple loci        | 1.65%
# Number of reads mapped to too many loci   | 1977
# % of reads mapped to too many loci        | 0.08%
# UNMAPPED READS:
# % of reads unmapped: too many mismatches  | 0.31%
# % of reads unmapped: too short            | 19.12%
# % of reads unmapped: other                | 2.55%
# CHIMERIC READS:
# Number of chimeric reads                  | 0
# % of chimeric reads                       | 0.00%
```

---

## Splice site frequencies in plants

In the log file, we saw the numbers of different observed splice sites.
How do those compare to our expectations?

---

```sh
# Number of splices: GT/AG                  | 1451397
# Number of splices: GC/AG                  | 14313
# Number of splices: AT/AC                  | 747
# Number of splices: Non-canonical          | 21325
```

<figure>

<img src=img/splice-sites_PuckerBrockington2018.png width="90%">
<figcaption><a href="https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5360-z">Pucker & Brockington 2018</a> </figcaption>

</figure>

---

## A script to loop over all files

```sh
#!/bin/bash

samples=($(cat "$sample_list"))

for sample in "${samples[@]}"; do

R1=$(ls "$fastq_dir"/*"$sample"*_R1*fastq.gz)

slurm_log=slurm-align-"$sample"-%j.out

echo "Submitting star_align.sh for $sample"
  sbatch -o "$slurm_log" scripts/align/star_align.sh \
      "$sample" "$fastq_dir" "$index_dir" "$ref_gff" "$bam_dir"

done
```

---

## Submit the loop script

```sh
# You should still have the other variables ($fastq_dir etc) assigned
$ sample_list=data/meta/samples.txt

$ bash scripts/align/star_align_loop.sh
      "$sample_list" "$fastq_dir" "$index_dir" "$ref_gff" "$bam_dir"
    
#> Starting script star_align_loop.sh
#> Wed Feb 10 21:03:13 EST 2021
#> -------------------
#> 
#> FASTQ dir: data/fastq_concat
#> Sample list: data/meta/samples.txt
#> Genome index dir: data/ref/Heinz1706_4.00/STAR_index
#> Genome annotation file: data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
#> Bam (output) directory: results/align/bam
#> 
#> Submitting star_align.sh for C6_myb2
#> Submitted batch job 12830287
#> Submitting star_align.sh for C6_myb3
#> Submitted batch job 12830291
#> ...
```

---

## Check the output (cont.)

```sh
$ du -sh $bam_dir/*bam
#> 230M    results/align/bam/C6_myb2_Aligned.sortedByCoord.out.bam
#> 270M    results/align/bam/C6_myb3_Aligned.sortedByCoord.out.bam
#> 183M    results/align/bam/C6_myb4_Aligned.sortedByCoord.out.bam
#> 237M    results/align/bam/cnt_1_Aligned.sortedByCoord.out.bam
#> 240M    results/align/bam/cnt_2_Aligned.sortedByCoord.out.bam
#> 211M    results/align/bam/cnt_3_Aligned.sortedByCoord.out.bam
#> 274M    results/align/bam/CT_05_Aligned.sortedByCoord.out.bam
#> 254M    results/align/bam/CT_06_Aligned.sortedByCoord.out.bam
#> 286M    results/align/bam/CT_09_Aligned.sortedByCoord.out.bam
#> 275M    results/align/bam/myb_1_Aligned.sortedByCoord.out.bam
#> 285M    results/align/bam/myb_2_Aligned.sortedByCoord.out.bam
#> 292M    results/align/bam/myb_3_Aligned.sortedByCoord.out.bam
#> 251M    results/align/bam/T_26_Aligned.sortedByCoord.out.bam
#> 284M    results/align/bam/T_29_Aligned.sortedByCoord.out.bam
#> 0       results/align/bam/T_30_Aligned.sortedByCoord.out.bam
```

---

## Check the output (cont.)

```sh
$ grep "Uniquely mapped reads %" results/align/bam/star_logs/*Log.final*
#> C6_myb2_Log.final.out: Uniquely mapped reads % |   76.30%
#> C6_myb3_Log.final.out: Uniquely mapped reads % |   79.46%
#> C6_myb4_Log.final.out: Uniquely mapped reads % |   56.44%
#> cnt_1_Log.final.out:   Uniquely mapped reads % |   71.11%
#> cnt_2_Log.final.out:   Uniquely mapped reads % |   78.40%
#> cnt_3_Log.final.out:   Uniquely mapped reads % |   66.60%
#> CT_05_Log.final.out:   Uniquely mapped reads % |   80.29%
#> CT_06_Log.final.out:   Uniquely mapped reads % |   81.05%
#> CT_09_Log.final.out:   Uniquely mapped reads % |   86.46%
#> myb_1_Log.final.out:   Uniquely mapped reads % |   87.14%
#> myb_2_Log.final.out:   Uniquely mapped reads % |   86.76%
#> myb_3_Log.final.out:   Uniquely mapped reads % |   87.18%
#> T_26_Log.final.out:    Uniquely mapped reads % |   85.01%
#> T_29_Log.final.out:    Uniquely mapped reads % |   86.01%
```

---

## Running MultiQC on the STAR logs

```sh
$ sbatch scripts/qc/multiqc.sh results/align/bam/star_logs
```

### Check MultiQC output

---

## Other options for QC and viewing of alignments

- **Qualimap**
  ([website](http://qualimap.conesalab.org/doc_html/index.html),
  [paper](https://pubmed.ncbi.nlm.nih.gov/26428292/))
  is an alignment QC tool, which can, for instance:
  - Compute statistics on the proportions of reads mapping to different genomic
    feature types such as exons, introns, and intergenic regions.
  - Quantify 3' and 5' bias.
  - Analyze splice junctions.
  
- Another option for extensive alignment QC is **RseQC**
  ([website](http://rseqc.sourceforge.net/),
  [paper](https://academic.oup.com/bioinformatics/article/28/16/2184/325191)).

---

## Other options for QC and viewing of alignments

Integrative Genome Viewer (**IGV**):

<figure>

<img src=img/igv.png width="100%">
<figcaption><a href="https://github.com/griffithlab/rnabio.org/blob/master/assets/lectures/cshl/2020/mini/RNASeq_MiniLecture_02_04_alignmentQC.pdf">Source</a> </figcaption>

</figure>