+ - 0:00:00
Notes for current slide
Notes for next slide

Tabulating reads with featureCounts





Jelmer Poelstra, MCIC Wooster

2021/03/12 (updated: 2021-03-12)

1 / 21

Overview of the analysis pipeline

2 / 21

Overview so far: files

3 / 21

Counting genes (and other genomic features)

  • Now that we have our alignments, we need to count how many reads align to each gene for each sample.

  • A genomic "feature" is an annotated elements in the genome such as a gene, exon, intron, or UTR.

    Therefore, counting "features" is the more general name for this, since we could just as well be counting exons.

4 / 21

Counting genes (and other genomic features)

  • Now that we have our alignments, we need to count how many reads align to each gene for each sample.

  • A genomic "feature" is an annotated elements in the genome such as a gene, exon, intron, or UTR.

    Therefore, counting "features" is the more general name for this, since we could just as well be counting exons.




Counting transcripts (i.e. distinguishing the products of alternative splicing within a gene) is a lot trickier than gene- or exon-counting, since transcripts will share a lot of features.

We will not count transcripts now, and the software we will use is also not designed for that.

4 / 21

Features are referenced using GFF annotation files

While genome sequences are stored in FASTA files, annotation information is stored in files like GFF (or GTF).

  • These files contain coordinates for genomic features (genes, transcripts, exons, etc.) – one per row.

  • They also contain lots of additional information for each feature.

Let's have a look at the GFF file we are working with.

5 / 21

GFF file for the tomato ITAG4.1 annotation

  • First, we have a header, whose lines start with a # (comment), and which has information about the different chromosomes/scaffold:
less -S data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
##gff-version 3
##sequence-regionSL4.0ch00 1 9643250
##sequence-regionSL4.0ch01 1 90863682
##sequence-regionSL4.0ch02 1 53473368
##sequence-regionSL4.0ch03 1 65298490
##sequence-regionSL4.0ch04 1 64459972
##sequence-regionSL4.0ch05 1 65269487
##sequence-regionSL4.0ch06 1 47258699
##sequence-regionSL4.0ch07 1 67883646
##sequence-regionSL4.0ch08 1 63995357
##sequence-regionSL4.0ch09 1 68513564
##sequence-regionSL4.0ch10 1 64792705
##sequence-regionSL4.0ch11 1 54379777
##sequence-regionSL4.0ch12 1 66688036
6 / 21

GFF file for the tomato ITAG4.1 annotation (cont.)

The main part of the file is tabular:

less -S data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
# chrom source feature start end score strand frame attribute
#> SL4.0ch00 maker gene 83863 84177 . + . ID=gene:Solyc00g160260.1;Name=Solyc00g160260.1
#> SL4.0ch00 maker mRNA 83863 84177 . + . #> ID=mRNA:Solyc00g160260.1.1;Parent=gene:Solyc00g160260.1;Name=Solyc00g160260.1.1;_aed=0.30;_eaed=0.59;_qi=0|0|0|1|0|0|2#> |0|100;Note=Homeobox leucine-zipper protein (AHRD V3.11 *-* tr|Q8H963|Q8H963_ZINVI)
#> SL4.0ch00 maker exon 83863 84043 . + . #> ID=exon:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
#> SL4.0ch00 maker CDS 83863 84043 . + 0 #> ID=CDS:Solyc00g160260.1.1.1;Parent=mRNA:Solyc00g160260.1.1
#> SL4.0ch00 maker exon 84056 84177 . + . #> ID=exon:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1
#> SL4.0ch00 maker CDS 84056 84177 . + 2 #> ID=CDS:Solyc00g160260.1.1.2;Parent=mRNA:Solyc00g160260.1.1
#> ###
#> SL4.0ch00 maker gene 166754 167268 . - . ID=gene:Solyc00g160270.1;Name=Solyc00g160270.1
#> SL4.0ch00 maker mRNA 166754 167268 . - . #> ID=mRNA:Solyc00g160270.1.1;Parent=gene:Solyc00g160270.1;Name=Solyc00g160270.1.1;_aed=0.25;_eaed=0.44;_qi=0|0|0|1|0|0|2#> |0|166;Note=Glucuronoxylan 4-O-methyltransferase-like protein (DUF579) (AHRD V3.11 *-
#> SL4.0ch00 maker exon 166754 167102 . - . #> ID=exon:Solyc00g160270.1.1.1;Parent=mRNA:Solyc00g160270.1.1
#> SL4.0ch00 maker CDS 166754 167102 . - 1 #> ID=CDS:Solyc00g160270.1.1.1;Parent=mRNA:Solyc00g160270.1.1
#> SL4.0ch00 maker exon 167117 167268 . - . #> ID=exon:Solyc00g160270.1.1.2;Parent=mRNA:Solyc00g160270.1.1
#> SL4.0ch00 maker CDS 167117 167268 . - 0 ID=CDS:Solyc00g160270.1.1.2;Parent=mRNA:Solyc00g160270.1.1
#> ###
7 / 21

Counting features

Figure from hbctraining.github.io
8 / 21

The result: a table with counts per gene per sample

Figure from hbctraining.github.io
9 / 21

featureCounts

As always, there are several programs available to do this, but we will use featureCounts, a very widely used tool that is part of the Subread package.

  • As mentioned, we will be counting a the gene level.

  • featureCounts will produce raw counts.

    That is, a count of 10 for gene_x and sample_a literally means that in the BAM file for sample_a, 10 qualifying reads (or read pairs) mapped to gene_x.

  • We will do some normalization prior to differential expression (DE) analysis, but don't need to transform data to relative counts like FPKM or TPM.

    This is because we are interested in comparing expression levels between samples but within genes.

10 / 21

How featureCounts counts

Figure from hbctraining.github.io
11 / 21

Creating a Conda environment with Subread

At OSC, run the following to create a Conda environment with the Subread package installed:

module load python/3.6-conda5.2
conda create -n subread-env -c bioconda subread

Check whether it worked:

source activate subread-env
featureCounts --help



From now on, to load the Conda module to run featureCounts:

module load python/3.6-conda5.2
source activate subread-env
12 / 21

Running featureCounts

We will use featureCounts with the following options:

featureCounts \
-s 2 \ # Reverse-stranded library
-p \ # Count fragments, not reads (paired-end)
-B \ # Require both of pair to be aligned
-C \ # Don't count pairs with discordant mates
-t gene \ # Feature type to count (e.g. exon or gene)
-g Name \ # Name of feature type in annotation file
-T 12 \ # Number of threads
-a "$gff_file" \ # Annotation file
-o "$output_file" \ # Main output file (count matrix)
"$bam_dir"/*bam # Any number of BAM files
13 / 21

Some other options and defaults

Option Details
-0 Assign reads that overlap with multiple features
(default: not assigned)
-M Include multi-mapping reads (default: not included)
--minOverlap Minimum number of overlapping bases required for
read assignment (default: 1)
-s Strand-specific read counting:
0=unstranded, 1=stranded, 2=reversely stranded


Also of note, featureCounts will:

Automatically sort paired-end reads. Users can provide either location-sorted or namesorted bams files to featureCounts. Read sorting is implemented on the fly and it only incurs minimal time cost.

14 / 21

Key parts of a script to run featureCounts

#!/bin/bash
#SBATCH --account=PAS0471
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=12
#SBATCH --out=slurm-featureCounts-%j.out
set -e -u -o pipefail
module load python/3.6-conda5.2
source activate subread-env
bam_dir=$1
output_dir=$2
gff_file=$3
mkdir -p "$output_dir"
output_file="$output_dir"/counts.txt
featureCounts -p -B -C -t gene -g Name \
-T "$SLURM_CPUS_ON_NODE" \
-a "$gff_file" \
-o "$output_file" \
"$bam_dir"/*bam
15 / 21

Prepare to run the featureCounts script

We will first clean up the SLURM log files a bit:

mkdir logs # Create a dir for the log files
mv slurm-* logs/ # Move all SLURM logs


Next, we have to remove the empty BAM file for the sample for which alignment failed – otherwise, featureCounts will complain:

rm results/align/bam/T_30_Aligned.sortedByCoord.out.bam
16 / 21

Run the featureCounts script

cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER/
bam_dir=results/align/bam
output_dir=results/count
gff_file=data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff
sbatch scripts/count/featureCounts.sh $bam_dir $output_dir $gff_file
17 / 21

featureCounts output: .summary file

less results/count/counts.txt.summary
# Status results/align/bam/C6_myb2_Aligned.sortedByCoord.out.bam # results/align/bam/C6_myb3_Aligned.sortedByCoord.out.bam # results/align/bam/C6_myb4_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_1_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_2_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_3_Aligned.sortedByCoord.out.bam # results/align/bam/CT_05_Aligned.sortedByCoord.out.bam # results/align/bam/CT_06_Aligned.sortedByCoord.out.bam # results/align/bam/CT_09_Aligned.sortedByCoord.out.bam # results/align/bam/myb_1_Aligned.sortedByCoord.out.bam # results/align/bam/myb_2_Aligned.sortedByCoord.out.bam # results/align/bam/myb_3_Aligned.sortedByCoord.out.bam # results/align/bam/T_26_Aligned.sortedByCoord.out.bam # results/align/bam/T_29_Aligned.sortedByCoord.out.bam
# Assigned 1782267 2149116 1346560 1857113 1860672 1646598 # 2177472 2026202 2228740 2147773 2215221 2262312 1926824 2294866
# Unassigned_Unmapped 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_Read_Type 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_Singleton 0 3 1 1 0 2 # 0 1 0 0 2 1 0 1
# Unassigned_MappingQuality 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_Chimera 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_FragmentLength 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_Duplicate 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_MultiMapping 128327 144273 119209 153962 174780 144870 # 138855 138661 141028 127180 154870 140478 117774 142431
# Unassigned_Secondary 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_NonSplit 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_NoFeatures 151678 195102 182434 213739 200924 192799 # 191260 179008 186799 166075 171115 173195 157256 182054
# Unassigned_Overlapping_Length 0 0 0 0 0 0 # 0 0 0 0 0 0 0 0
# Unassigned_Ambiguity 10703 12905 8101 12647 12128 10852 # 12899 11531 12289 13778 12647 13605 11158 13119
18 / 21

featureCounts output: count table

# # Program:featureCounts v2.0.1; Command:"featureCounts" "-s" "2" "-p" "-B" "-t" # "gene" "-g" "Name" "-T" "4" "-a" "data/ref/annot/ITAG4.1/ITAG4.1_gene_models.gff" # "-o" "results/count/counts.txt" # "results/align/bam/C6_myb2_Aligned.sortedByCoord.out.bam" # "results/align/bam/C6_myb3_Aligned.sortedByCoord.out.bam" # "results/align/bam/C6_myb4_Aligned.sortedByCoord.out.bam" # "results/align/bam/cnt_1_Aligned.sortedByCoord.out.bam" # "results/align/bam/cnt_2_Aligned.sortedByCoord.out.bam" # "results/align/bam/cnt_3_Aligned.sortedByCoord.out.bam" # "results/align/bam/CT_05_Aligned.sortedByCoord.out.bam" # "results/align/bam/CT_06_Aligned.sortedByCoord.out.bam" # "results/align/bam/CT_09_Aligned.sortedByCoord.out.bam" # "results/align/bam/myb_1_Aligned.sortedByCoord.out.bam" # "results/align/bam/myb_2_Aligned.sortedByCoord.out.bam" # "results/align/bam/myb_3_Aligned.sortedByCoord.out.bam" # "results/align/bam/T_26_Aligned.sortedByCoord.out.bam" # "results/align/bam/T_29_Aligned.sortedByCoord.out.bam"
# Geneid Chr Start End Strand Length # results/align/bam/C6_myb2_Aligned.sortedByCoord.out.bam # results/align/bam/C6_myb3_Aligned.sortedByCoord.out.bam # results/align/bam/C6_myb4_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_1_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_2_Aligned.sortedByCoord.out.bam # results/align/bam/cnt_3_Aligned.sortedByCoord.out.bam # results/align/bam/CT_05_Aligned.sortedByCoord.out.bam # results/align/bam/CT_06_Aligned.sortedByCoord.out.bam # results/align/bam/CT_09_Aligned.sortedByCoord.out.bam # results/align/bam/myb_1_Aligned.sortedByCoord.out.bam # results/align/bam/myb_2_Aligned.sortedByCoord.out.bam # results/align/bam/myb_3_Aligned.sortedByCoord.out.bam # results/align/bam/T_26_Aligned.sortedByCoord.out.bam # results/align/bam/T_29_Aligned.sortedByCoord.out.bam
# Solyc00g160260.1 SL4.0ch00 83863 84177 + 315 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g160270.1 SL4.0ch00 166754 167268 - 515 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500003.1 SL4.0ch00 311496 382066 - 70571 1 0 # 0 0 3 0 0 1 1 3 1 1 0 # 2
# Solyc00g500004.1 SL4.0ch00 417592 418482 + 891 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500005.1 SL4.0ch00 478389 478640 + 252 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500006.1 SL4.0ch00 481288 483182 - 1895 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500007.1 SL4.0ch00 497941 498614 + 674 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500008.1 SL4.0ch00 531008 531682 + 675 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500009.1 SL4.0ch00 531761 537711 + 5951 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500010.1 SL4.0ch00 544694 545489 + 796 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500011.1 SL4.0ch00 549361 586470 + 37110 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500013.1 SL4.0ch00 589339 596916 + 7578 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Solyc00g500014.1 SL4.0ch00 628120 644283 + 16164 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500015.1 SL4.0ch00 701546 719208 + 17663 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc02g010630.2 SL4.0ch00 900015 902268 + 2254 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc02g010600.2 SL4.0ch00 903917 904413 - 497 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500016.1 SL4.0ch00 914971 916182 - 1212 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500017.1 SL4.0ch00 994894 998273 + 3380 0 0 # 0 0 0 0 0 0 0 0 0 0 0 # 0
# Solyc00g500018.1 SL4.0ch00 1000060 1003182 - 3123 0 0 0 0 0 0 0 0 0 0 0 0 0 0
19 / 21

Preparing to run MultiQC for all our results so far

Because we switched FASTQ files after running FastQC, we first need to run FastQC again:

# Remove the old FastQC results:
rm -r results/QC_fastq/*
# Rerun FastQC:
fastq_dir=data/fastq_concat_subsample
fastq_dir=results/QC_fastq
scripts/qc/fastqc_dir.sh "$fastq_dir" "$fastqc_dir"
20 / 21

Running MultiQC for all our results so far

mkdir results/multiqc
module load python/3.6-conda5.2
source activate multiqc-env
multiqc results/ -o results/multiqc
ls results/multiqc
#> multiqc_report.html multiqc_data
21 / 21

Overview of the analysis pipeline

2 / 21
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
sToggle scribble toolbox
Esc Back to slideshow