Getting set up

Start an RStudio session at OSC

Show instructions

  • Fill out the form as shown here.

  • Now, you should see a box like this:

  • Your job should start running pretty soon, and when it’s ready the box should look like this:

  • Click Connect to RStudio Server at the bottom of the box, and an RStudio Server instance will open. You’re ready to go!

Create an RStudio Project

Show instructions
basedir <- "/fs/project/PAS0471/teach/misc/2021-02_rnaseq/"

# Get your user name (by running a shell command via the `system()` function:
me <- system("echo $USER", intern = TRUE)

# Build the path to the target dir:
proj_dir <- file.path(basedir, me)

# Create the Project:
library(usethis)
create_project(path = proj_dir)

Now, RStudio will reload with the newly created Project open.

If you get the pop-up below, click Don't Save (do this whenever you get that pop-up):

Copy the count data and the metadata files

Go the Terminal tab next to the R Console in the bottom left of the RStudio window:

cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER

cp ../master/data/meta/metadata.txt data/meta
cp ../master/results/count/gene_counts_all.txt results/count

Load the necessary packages

if(! "pacman" %in% installed.packages()) install.packages("pacman")

packages <- c("DESeq2",          # Differential expression analysis
              "tidyverse",       # Misc. data manipulation and plotting
              "here",            # Managing file paths
              "pheatmap",        # Heatmap plot
              "apeglm",          # For LFC shrinkage
              "knitr")           # For table printing in Markdown file
pacman::p_load(char = packages)

theme_set(theme_bw())  # Set ggplot theme

Input and output dirs and files

Input files:

  • Gene counts table – the file exactly as it was written by featureCounts.

  • Metadata table – so we can group our samples and make comparisons between these groups.

count_table_file <- here("results/count/gene_counts_all.txt")
metadata_file <- here("data/meta/metadata.txt")

Output directories – and we create them if they don’t already exist:

outdir <- here("results/DE/")
plotdir <- here("results/DE/fig/")

if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)

Load input data

Load the count table from featureCounts:

raw_counts <- read.table(count_table_file,
                         sep = "\t", header = TRUE, skip = 1)

Load the metadata information:

metadata <- read.table(metadata_file, header = TRUE)

kable(head(metadata))
SampleID Experiment GH_trial AMF Treatment
C6_myb2 CAPS CAPS_6 Ri Agrobacterium_noexp
C6_myb3 CAPS CAPS_6 Ri Agrobacterium_noexp
C6_myb4 CAPS CAPS_6 Ri Agrobacterium_noexp
CT_05 CAPS CAPS_6 Ri mock
CT_06 CAPS CAPS_6 Ri mock
CT_09 CAPS CAPS_6 Ri mock

The Treatment column currently has the values Agrobacterium_noexp, Agrobacterium_myb, and mock.

To shorten this a bit, we’ll get rid of “Agrobacterium_”:

metadata$Treatment <- sub("Agrobacterium_", "", metadata$Treatment)

unique(metadata$Treatment)
## [1] "noexp" "mock"  "myb"



Prepare the data

Change the column names, which are very long:

colnames(raw_counts)[7:8]
## [1] "X.fs.scratch.PAS1548.JamesSeq.CAPS.tomato.star_out.X6465_Benitez.PonceM_C6_myb2_V1N_1_S1_L001_R1_001.fastq.gzAligned.sortedByCoord.out.bam"
## [2] "X.fs.scratch.PAS1548.JamesSeq.CAPS.tomato.star_out.X6465_Benitez.PonceM_C6_myb2_V1N_1_S1_L002_R1_001.fastq.gzAligned.sortedByCoord.out.bam"
my_regex <- ".+PonceM_(.+)_V1N.+"
colnames(raw_counts) <- sub(my_regex, "\\1", colnames(raw_counts))

colnames(raw_counts)
##  [1] "Geneid"  "Chr"     "Start"   "End"     "Strand"  "Length"  "C6_myb2"
##  [8] "C6_myb2" "C6_myb3" "C6_myb3" "C6_myb4" "C6_myb4" "CT_05"   "CT_05"  
## [15] "CT_06"   "CT_06"   "CT_09"   "CT_09"   "T_26"    "T_26"    "T_29"   
## [22] "T_29"    "T_30"    "T_30"    "cnt_1"   "cnt_1"   "cnt_2"   "cnt_2"  
## [29] "cnt_3"   "cnt_3"   "myb_1"   "myb_1"   "myb_2"   "myb_2"   "myb_3"  
## [36] "myb_3"

Besides the counts, there are columns with metadata for each gene:

raw_counts[1:5, 1:8]
##             Geneid       Chr  Start    End Strand Length C6_myb2 C6_myb2.1
## 1 Solyc00g160260.1 SL4.0ch00  83863  84177      +    315       0         0
## 2 Solyc00g160270.1 SL4.0ch00 166754 167268      -    515       0         0
## 3 Solyc00g500003.1 SL4.0ch00 311496 382066      -  70571       1         4
## 4 Solyc00g500004.1 SL4.0ch00 417592 418482      +    891       0         0
## 5 Solyc00g500005.1 SL4.0ch00 478389 478640      +    252       0         0

Let’s remove those:

counts <- raw_counts[, 7:ncol(raw_counts)]
rownames(counts) <- raw_counts$Geneid

Replicate samples

In this table, there are two separate entries for each sample: each library was sequenced on two lanes. Recall that in our workflow, we had merged these FASTQ files prior to mapping, but here we are using the table based on the full dataset produced by Matthew.

So, we will go ahead and merge these replicates now, by simply summing the counts:

counts.t <- t(counts)
rownames(counts.t) <- names(raw_counts)[7:36]
sums <- rowsum(counts.t, group = rownames(counts.t))
counts <- t(sums)

(Alternatively, one could use the specialized function DESeq2::collapseReplicates() for this.)

Check sample IDs

For differential expression analysis, we will be using the popularDESeq2 R/Bioconductor package (paper, website).

We will load both the count table and the metadata into DESeq2. When doing so, DESeq2 assumes that sample IDs in both tables match and are provided in the same order. Let’s make sure this is indeed the case.

Sort both data frames alphabetically:

metadata <- metadata[order(metadata$SampleID), ]
counts <- counts[, order(colnames(counts))]

Check if names are the same:

metadata$SampleID
##  [1] "C6_myb2" "C6_myb3" "C6_myb4" "cnt_1"   "cnt_2"   "cnt_3"   "CT_05"  
##  [8] "CT_06"   "CT_09"   "myb_1"   "myb_2"   "myb_3"   "T_26"    "T_29"   
## [15] "T_30"
colnames(counts)
##  [1] "C6_myb2" "C6_myb3" "C6_myb4" "cnt_1"   "cnt_2"   "cnt_3"   "CT_05"  
##  [8] "CT_06"   "CT_09"   "myb_1"   "myb_2"   "myb_3"   "T_26"    "T_29"   
## [15] "T_30"
matching_names <- identical(metadata$SampleID, colnames(counts))
matching_names
## [1] TRUE
if(matching_names == FALSE) stop("Sample ID in metadata and count matrix do not match!")

Create the DESeq2 object

We will create the DESeq2 object using the function DESeqDataSetFromMatrix(), which we will provide with three pieces of information:

  • The count data with argument countData.
  • The metadata with argument colData.
  • The model design for the DE analysis – argument design.
    For now, we will specify ~1, which means “no design” – we will change this before the actual DE analysis.
dds_raw <- DESeqDataSetFromMatrix(countData = counts,
                                  colData = metadata,
                                  design = ~ 1)

Remove the Ri_noexp group

#dds_raw <- dds_raw[, -which(dds_raw@colData$group == "Ri_noexp")]
#dds_raw@colData$group <- droplevels(dds_raw@colData$group)

dds_raw <- dds_raw[, -which(dds_raw@colData$Treatment == "noexp")]
#dds_raw@colData$group <- droplevels(dds_raw@colData$group)



Explore the count data

What are number of rows and columns of the count matrix?

dim(counts)
## [1] 34688    15

How many genes have non-zero counts?

dim(counts[rowSums(counts) > 0, ])
## [1] 28145    15

How many genes have total counts of at least 10?

dim(counts[rowSums(counts) >= 10, ])
## [1] 24771    15

Histogram of gene counts

Let’s plot a histogram of gene counts:

theme_set(theme_bw())

summed_gene_counts <- data.frame(gene_count = rowSums(counts)) %>%
  rownames_to_column("gene_id")

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 10000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0,0))

Zoom in a bit:

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 1000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(limits = c(0, 200000), expand = c(0,0)) +
  theme(plot.margin = margin(0.5, 0.7, 0.5, 0.5, "cm"))

How are counts distributed across samples? That is, we would like a sum of counts for each column. To get this, we use the apply() function, which can apply a function (in our case sum()) to all columns (hence MARGIN = 2 – for rows, use 1) of our counts dataframe:

apply(X = counts, MARGIN = 2, FUN = sum)
##  C6_myb2  C6_myb3  C6_myb4    cnt_1    cnt_2    cnt_3    CT_05    CT_06 
## 18122406 21913320 13659702 18844494 18890045 16634169 22183005 20620412 
##    CT_09    myb_1    myb_2    myb_3     T_26     T_29     T_30 
## 22924853 22070790 22829207 23195471 19902716 23221528 15793928



Principal Component Analysis (PCA)

Run the PCA and prepare for plotting

First, we normalize the count data to have even sampling across samples (with respect to library size) and approximately even variance:

vsd <- varianceStabilizingTransformation(dds_raw, blind = TRUE)

Next, we run the PCA and retrieve the data to plot with ggplot2:

pcaData <- plotPCA(vsd,
                   ntop = 500,
                   intgroup = c("AMF", "Treatment"),
                   returnData = TRUE)

We extract the percentage of variance explained by different principal components, so we can later add this information to the plot:

percentVar <- round(100 * attr(pcaData, "percentVar"))
percentVar
## [1] 58 16

We create a plot title with the species name in italic using the somewhat bizarre expression() function:

plot_title <- expression("PCA of " * italic(Glycine ~ max) * " transcript count data")

Plot the PCA results

ggplot(pcaData,
       aes(x = PC1, y = PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 6) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

Plot again – with sample names

library(ggrepel)

pca_plot2 <- ggplot(pcaData,
              aes(PC1, PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 3) +
  geom_label_repel(aes(label = name)) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

print(pca_plot2)



DE analysis – full dataset

The design has two factors: AMF and Treatment. Rather than fit a multivariate model, we can start by merging the two into a single factor called group, and fit a univariate model with this factor.

dds_raw$group <- factor(paste(dds_raw$AMF, dds_raw$Treatment, sep = "_"))
table(dds_raw$group)
## 
## control_mock  control_myb      Ri_mock       Ri_myb 
##            3            3            3            3

We will set the “reference” level of the factor to be the double negative control (empty substrate, no Agrobacteria):

dds_raw$group <- relevel(dds_raw$group, ref = "control_mock")
dds_raw$group
##  [1] control_mock control_mock control_mock Ri_mock      Ri_mock     
##  [6] Ri_mock      control_myb  control_myb  control_myb  Ri_myb      
## [11] Ri_myb       Ri_myb      
## Levels: control_mock control_myb Ri_mock Ri_myb

Next, we set the analysis design:

design(dds_raw) <- formula(~ group)

And finally, we perform the differential expression analysis with the DEseq() function:

dds <- DESeq(dds_raw)

The DESeq() function above performs three steps consecutively:

  • estimateSizeFactors() – “Normalization” by library size and composition.

    Note that DESeq2 doesn’t actually normalize the counts in the sense that it produces a matrix with adjusted counts. Instead it uses raw counts and includes the size factors in the modeling.

    To learn more about gene count normalization, see this video and this page.

  • estimateDispersions() – Estimate gene-wise dispersion (variance in counts).

  • nbinomWaldTest(ddsObj) – Fit the negative binomial GLM and calculate Wald statistics, which is the test statistic underlying the p-value for whether a gene is differentially expressed.

These functions could also be called separately, which would be useful if you want to be able to change more defaults.

The results table

res <- results(dds)
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 -0.3323131 1.218149 -0.2728018 0.7850056 0.8764988
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA

By default, the results table prints statistics comparing the last level of the factor with the first level: that is, log-fold change and p-values describe differences between these two levels specifically. However, we can easily extract equivalent statistics for any pairwise comparison among our factor levels, which we will see later.

For now, we will explore what each column in this table means:

  • The baseMean column contains the mean expression level across all samples.

  • The log2FoldChange column contains the log2-fold change of gene counts between the compared levels, that is, it represents the effect size.

    A log2-fold change of 1 indicates that the expression in the reference level is two-fold lower than that of the other level, a log2-fold change of 2 indicates a four-fold difference, a log2-fold change of 3 indicates an eight-fold difference, and so on.

    Similarly, negative log2-fold values indicate a change in gene counts in the other direction: the reference level is higher than the other level.

  • The lfcSE column indicates the uncertainty in terms of the standard error (SE) of the log2-fold change estimate.

  • The stat column indicates the value for the Wald test’s test statistic.

  • The pvalue column reported the uncorrected p-value from the Wald test.

  • Because we are testing significance for many genes, we need to correct for multiple testing. DESeq2 uses the Benjamini-Hochberg False Discovery Rate (FDR) correction, and these values are reported in the column padj (i.e., adjusted p-value).

A summary of this information about each column can be seen by running the mcols() function:

mcols(res)
## DataFrame with 6 rows and 2 columns
##                        type            description
##                 <character>            <character>
## baseMean       intermediate mean of normalized c..
## log2FoldChange      results log2 fold change (ML..
## lfcSE               results standard error: grou..
## stat                results Wald statistic: grou..
## pvalue              results Wald test p-value: g..
## padj                results   BH adjusted p-values

NA values in the results table

Some values in the results table can be set to NA for one of the following reasons:

  • If a gene contains a sample with a count outlier, both the p-value and adjusted p-value will be set to NA. (DESeq2 performs outlier detection using Cook’s distance.)

  • If all samples have zero counts for a given gene, the baseMean column will be zero, and the log2-fold change estimates, p-value and adjusted p-value will all be set to NA.

  • DESeq2 also automatically filters genes with a low mean count in the sense that it does not include them in the multiple testing correction. Therefore, in such cases, the p-value will not be NA, but the adjusted p-value will be.

    Because we have very low power to detect differential expression for such low-count genes, it is beneficial to remove them prior to the multiple testing correction: that way, the correction becomes less severe for the remaining genes.

Let’s see how many genes have NA p-values:

# Number of genes with NA p-value:
sum(is.na(res$pvalue))
## [1] 6990
# As a proportion of the total number of genes in the test:
sum(is.na(res$pvalue)) / nrow(res)
## [1] 0.2015106

And NA adjusted p-values:

# Number of genes with NA p-value:
sum(is.na(res$padj))
## [1] 10709
# As a proportion of the total number of genes in the test:
sum(is.na(res$padj)) / nrow(res)
## [1] 0.3087235



DE analysis – contrast two custom groups

Using the resultsNames function, we can see which pairwise contrasts between different levels of the factor are available (though it is not displayed in a particularly readable fashion):

resultsNames(dds)
## [1] "Intercept"                         "group_control_myb_vs_control_mock"
## [3] "group_Ri_mock_vs_control_mock"     "group_Ri_myb_vs_control_mock"

Not all pairwise contrasts between the 5 levels in our group factor are available here: instead, control_mock, which we set as the reference level, is being compared with the other 3 levels. (However, we can make other pairwise comparisons, too.)

Above, we looked at the results for the last of these comparisons (group_Ri_myb_vs_control_mock, i.e. “Ri_myb” vs. “control_mock”), simply because DESeq2 will show the last comparison by default when calling the results() function.

To see the results table for one of the other 3 comparisons, we pass a vector to the contrast argument of the results() function with the factor (group) and the two levels to be contrasted. For example, to see the results for “Ri_mock” vs. “control_mock”:

# Here, we could specify *any* pairwise contrast,
# not just the ones with "control_mock" that resultsNames() prints as seen above.
my_contrast <- c("Ri_mock", "control_mock")

res <- results(dds,
               contrast = c("group", my_contrast))

How many adjusted p-values were less than 0.1?

sum(res$padj < 0.1, na.rm = TRUE)
## [1] 5217

We’ll also create an object with adjusted (shrunken) LFC estimates, which will be useful for visualization and ranking of genes by LFC:

my_coef <- paste0("group_", paste0(my_contrast, collapse = "_vs_"))
my_coef
## [1] "group_Ri_mock_vs_control_mock"
res_LFC <- lfcShrink(dds,
                     coef = my_coef,
                     type = "apeglm",
                     lfcThreshold = 1)

Here, we had to provide the contrast (“coefficient”) in the format given by resultsNames(dds). (And that looks a bit confusing because this format uses underscores to separate levels, while our factor levels themselves also contain underscores.)

We also specified a threshold of 1 for the LFC value (lfcThreshold = 1), so we get s-values (analogous to p-values) that test not for differential expression of any magnitude (as in the tests above), but whether the LFC is greater than our specified threshold:

head(res_LFC)
## log2 fold change (MAP): group Ri mock vs control mock 
##  
## DataFrame with 6 rows and 4 columns
##                   baseMean log2FoldChange     lfcSE    svalue
##                  <numeric>      <numeric> <numeric> <numeric>
## Solyc00g160260.1    0.0000             NA        NA        NA
## Solyc00g160270.1    0.0000             NA        NA        NA
## Solyc00g500003.1   11.1187      -0.250441  0.536045  0.576886
## Solyc00g500004.1    0.0000             NA        NA        NA
## Solyc00g500005.1    0.0000             NA        NA        NA
## Solyc00g500006.1    0.0000             NA        NA        NA
sum(res_LFC$svalue < 0.1, na.rm = TRUE)
## [1] 1441

This can be a useful way to try and tease out statistical from biological significance.



Visually exploring the results

We will create a few plots, by way of example, of the results for the “Ri_mock” versus “control_mock” comparison, which we extracted above.

MA-plot:

For a nice overview of the results, we can plot a so-called “MA plot”. An MA plot shows, for each gene:

  • Count differences in terms of LFC between two groups, on the y-axis.

  • Mean counts across both groups, on the x-axis.

We can create an MA plot using DESeq2’s plotMA function, with significantly differentially expressed genes displayed in blue:

plotMA(res, ylim = c(-5, 5))

To be able to customize the plot, we’ll use returnData = TRUE like we have done with previous plots, and then plot the resulting dataframe with ggplot2:

d <- plotMA(res, returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_continuous(limits = c(-10, 10)) +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "LFC")

We can see that lowly-expressed genes tend to deviate from an LFC of 0 (same mean expression levels in the two groups) much more than highly-expressed genes do. However, this is an artifact of noise overpowering the signal when expression values are low. We can also see that no genes in the far left part of the plot are differentially expressed: this is due to this same lack of power.

DESeq2 provides several methods to adjust LFC estimates for this low-expression bias. We used one of those (lfcShrink()) above to produce the res_LFC object. Let’s create another MA plot with these adjusted LFC estimates:

d <- plotMA(res_LFC, ylim = c(-5, 5), returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

Note that significance is now specified for LFC > 1 – FINISH

Finally, for a plot like this, it could be useful to be able to identify individual genes. There are way too many to print the gene names on the plot, though. Instead, we can also make the plot interactive with Plotly, so we can see the identity of each gene when we hover over the point:

library(plotly)

# To show the gene name, we need to have a column with gene names.
# Currently, the gene names are row names but ggplot2 (and other tidyverse
# applications) don't like that, so we create a column with gene names:
d$gene <- rownames(d)

# First we create a very similar ggplot to what we did above,
# but we assign gene names to "text":
p_ma <- ggplot(d,
               aes(x = mean, y = lfc, color = isDE, text = gene)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

# Finally, we make the plot interactive and tell Plotly that it should show
# the "text" (i.e. gene names) as the "tooltip", meaning upon hovering:
ggplotly(p_ma, tooltip = "text")

Plot specific genes

We can also create plot of expression levels for individual genes. That is especially interesting for genes with highly significant differential expression.

Let’s plot the top-5 most significantly differentially expressed genes:

# First, we select the 5 genes with the lowest adjusted p-value:
top5 <- row.names(res[order(res$padj)[1:5], ])
# Then we create a function to make a plot for a single gene:
plotgene <- function(geneID, dds) {
  
  d <- plotCounts(dds,
                  gene = geneID,
                  intgroup = "group",
                  returnData = TRUE)

  p <- ggplot(d, aes(x = group, y = count)) +
          geom_point(position = position_jitter(w = 0.1, h = 0)) +
          labs(title = geneID) +
          theme_bw()
  
  print(p)
}

Finally, we use sapply() to apply this function to each of our genes in the top5 vector.

none <- sapply(top5, plotgene, dds)

If we wanted to, we could easily create plots for 100s of genes, this way.

Heatmap

We can create heatmaps with the pheatmap function. Let’s start by creating a function that will plot a heatmap given a vector of gene IDs and a DESeq2 object dds:

plot_heatmap <- function(geneIDs, dds) {
  
  ntd <- assay(normTransform(dds))
  
  ntd_sel <- ntd[match(geneIDs, rownames(ntd)), ]
  df_meta <- as.data.frame(colData(dds)[, c("AMF", "Treatment")])
  
  pheatmap(ntd_sel,
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         annotation_col = df_meta)
}

Now, we can easily create a heatmap for the top-20 most highly differentially expressed genes:

top20_DE <- row.names(res[order(res$padj)[1:20], ])
plot_heatmap(top20_DE, dds)

Or for the 20 most highly expressed genes:

top20_hi_idx <- order(rowMeans(counts(dds, normalized = TRUE)),
                  decreasing = TRUE)[1:20]
top20_hi <- row.names(dds)[top20_hi_idx]
plot_heatmap(top20_hi, dds)

Export the results

Let’s save the results dataframe to file.

Note that it will only contain the results for one comparison. Also, if we write the results dataframe to file, we won’t be able to tell from the file what the comparison is, so let’s store that in two columns:

my_contrast
## [1] "Ri_mock"      "control_mock"
my_contrast_pasted <- paste0(my_contrast, collapse = "_vs_")
my_contrast_pasted
## [1] "Ri_mock_vs_control_mock"
res$level1 <- my_contrast[1]
res$level2 <- my_contrast[2]
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj level1 level2
Solyc00g160260.1 0.00000 NA NA NA NA NA Ri_mock control_mock
Solyc00g160270.1 0.00000 NA NA NA NA NA Ri_mock control_mock
Solyc00g500003.1 11.11875 -1.917985 1.255117 -1.528132 0.1264797 0.3040457 Ri_mock control_mock
Solyc00g500004.1 0.00000 NA NA NA NA NA Ri_mock control_mock
Solyc00g500005.1 0.00000 NA NA NA NA NA Ri_mock control_mock
Solyc00g500006.1 0.00000 NA NA NA NA NA Ri_mock control_mock

Now we can write res to file:

res_file <- file.path(outdir, paste0(my_contrast_pasted, '_all-res.txt'))

write.table(res, res_file,
            sep = '\t', row.names = TRUE, quote = FALSE)

We may also want to save a separate table with only significant results:

res_sig_file <- file.path(outdir, paste0(my_contrast_pasted, '_sig-res.txt'))

res_sig <- res %>%
  as.data.frame() %>% 
  dplyr::filter(padj < 0.1)

write.table(res_sig, res_sig_file,
            sep = '\t', row.names = TRUE, quote = FALSE)



DE analysis – all pairwise comparisons

To run the DE analysis for all pairwise comparisons, we will start by writing a function that takes a comparison (contrast) in the form of c(level1, level2) along with a DESeq2 object (dds), and output a data frame with significantly differentially expressed genes:

sig_contrast <- function(my_contrast, dds) {
  
  res_sig <- results(dds,
                     contrast = c("group", my_contrast)) %>%
    as.data.frame() %>% 
    dplyr::filter(padj < 0.1) %>%
    mutate(level1 = my_contrast[1],
           level2 = my_contrast[2])
  
  cat(my_contrast[1], "versus", my_contrast[2], ":", nrow(res_sig), "significant\n")
  
  return(res_sig)
}

Then, we will create a list with all pairwise combinations of our group factor using the combn() function:

comps <- combn(levels(dds@colData$group), 2, simplify = FALSE)

Finally, we run the function for all of our pairwise comparisons, and using do.call(rbind, ...), we concatenate all the results in a single data frame:

sig_all_contrasts <- do.call(rbind, lapply(comps, sig_contrast, dds))
## control_mock versus control_myb : 8327 significant
## control_mock versus Ri_mock : 5217 significant
## control_mock versus Ri_myb : 6109 significant
## control_myb versus Ri_mock : 4223 significant
## control_myb versus Ri_myb : 2263 significant
## Ri_mock versus Ri_myb : 388 significant
# The `kable()` function is just to display tables nicely in this R Markdown
# document -- use `head(sig_all_contrasts)` in your own code.
kable(head(sig_all_contrasts))
baseMean log2FoldChange lfcSE stat pvalue padj level1 level2
61.886942 2.682127 0.5736071 4.675896 0.0000029 0.0001509 control_mock control_myb
3.043385 5.914974 2.2213307 2.662806 0.0077492 0.0347193 control_mock control_myb
2.158579 5.117064 1.8784057 2.724153 0.0064467 0.0305506 control_mock control_myb
4393.208985 1.725896 0.3936111 4.384775 0.0000116 0.0003991 control_mock control_myb
82.440348 1.095528 0.3450265 3.175200 0.0014973 0.0109290 control_mock control_myb
8.675927 1.984984 0.7470738 2.657012 0.0078837 0.0351581 control_mock control_myb

DE analysis – with two factors

Controlling for one factor

Say we wanted to analyze the effect of “mock” versus “myb” (“Treatment” column) while controlling for the effects of “control” versus “Ri” (“AMF” column).

Let’s start by turning “Treatment” and “AMF” into factors, and saving a new DESeq2 object:

# Convert Treatment and AMF into factors: 
dds_raw$Treatment <- relevel(factor(dds_raw$Treatment), ref = "mock")
dds_raw$AMF <- relevel(factor(dds_raw$AMF), ref = "control")

# Save a new object:
dds_2f_raw <- dds_raw

To include both factors, we use a + in the formula. Note that the order matters: using AMF + Treatment, we test for the effect of “Treatment” (the last factor), while controlling for the effect of AMF (the first factor).

design(dds_2f_raw) <- formula(~ AMF + Treatment)

Run DESeq with the new design:

dds_2f <- DESeq(dds_2f_raw)
res <- results(dds_2f)

How many adjusted p-values were less than 0.1?

sum(res$padj < 0.1, na.rm = TRUE)
## [1] 3719

With an interaction term

We can add an interaction term using notation like level1:level2, with a colon. In this case, we want to include both “AMF”, “Treatment”, and the interaction between the two:

# Save a new object:
dds_2fi_raw <- dds_raw

# The interaction term is `AMF:Treatment`:
design(dds_2fi_raw) <- formula(~ AMF + Treatment + AMF:Treatment)

Run DESeq with the new design:

dds_2fi <- DESeq(dds_2fi_raw)

resultsNames(dds_2fi)
## [1] "Intercept"             "AMF_Ri_vs_control"     "Treatment_myb_vs_mock"
## [4] "AMFRi.Treatmentmyb"

From the DESeq2 vignette: >The key point to remember about designs with interaction terms is that, unlike for a design ~ genotype + condition, where the condition effect represents the overall effect controlling for differences due to genotype, by adding genotype:condition, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level). The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype.

Now, we can check a number of different effects:

Does the effects of Treatment differ among levels of AMF?

This is described by the interaction term, which came last in our formula. Therefore, the results() function will get results for the interaction term by default:

res <- results(dds_2fi)
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 1.971865 1.747953 1.128099 0.2592781 0.5103138
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA

The effect of Treatment in “control” AMF:

res <- results(dds_2fi,
               contrast = c("Treatment", "myb", "mock"))
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 -0.3861962 1.215878 -0.3176274 0.7507676 0.8455709
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA

The effect of Treatment in “myb” AMF:

res <- results(dds_2fi,
               contrast = list(c("Treatment_myb_vs_mock", "AMFRi.Treatmentmyb")))
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 1.585668 1.255784 1.262692 0.2066998 0.9374357
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA

The effect of AMF in the “mock” Treatment:

res <- results(dds_2fi,
                contrast = c("AMF", "Ri", "control"))
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 -1.917982 1.255115 -1.528132 0.1264798 0.3040459
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA

The effect of AMF in “myb” Treatment:

res <- results(dds_2fi,
               contrast = list(c("AMF_Ri_vs_control", "AMFRi.Treatmentmyb")))
kable(head(res))
baseMean log2FoldChange lfcSE stat pvalue padj
Solyc00g160260.1 0.00000 NA NA NA NA NA
Solyc00g160270.1 0.00000 NA NA NA NA NA
Solyc00g500003.1 11.11875 0.0538826 1.216568 0.0442907 0.9646727 0.9859513
Solyc00g500004.1 0.00000 NA NA NA NA NA
Solyc00g500005.1 0.00000 NA NA NA NA NA
Solyc00g500006.1 0.00000 NA NA NA NA NA
---
title: "Differential expression (DE) analysis <br> with DESeq2"
author: "Jelmer Poelstra"
date: "3/26/2021"
output:
  html_document:
    code_download: true
    toc_depth: 3
    toc: true
    toc_float: true
    theme: cerulean
    highlight: tango
    anchor_sections: true
    df_print: kable
    css: html_page.css
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo=TRUE, eval=TRUE, cache=TRUE,
  warning=FALSE, message=FALSE,
  class.source = "r_code",
  class.output = "r_output",
  class.warning = "r_warning",
  class.message = "r_warning",
  class.error = "r_error"
  )
```

<br> <br>

----

## Getting set up

### Start an RStudio session at OSC

<details><summary>Show instructions</summary>

- Login to OSC at <https://ondemand.osc.edu>.

- Click on `Interactive Apps` (top bar) > `RStudio Server (Owens and Pitzer)`

<p align="center">
<img src=img/apps.png width="500">
</p>

- Fill out the form as shown [here](slides/03-OSC-slides.html#rstudio_server_job).

- Now, you should see a box like this:

<p align="center">
<img src=img/osc_queued.png width="700">
</p>

- Your job should start running pretty soon, and when it's ready the box should look like this: 

<p align="center">
<img src=img/osc_running.png width="700">
</p>

- Click `Connect to RStudio Server` at the bottom of the box, and an RStudio Server instance will open. You're ready to go!

</details>

### Create an RStudio Project

<details><summary>Show instructions</summary>

```{r, eval=FALSE}
basedir <- "/fs/project/PAS0471/teach/misc/2021-02_rnaseq/"

# Get your user name (by running a shell command via the `system()` function:
me <- system("echo $USER", intern = TRUE)

# Build the path to the target dir:
proj_dir <- file.path(basedir, me)

# Create the Project:
library(usethis)
create_project(path = proj_dir)
```

Now, RStudio will reload with the newly created Project open.

If you get the pop-up below, click `Don't Save` (do this whenever you get that pop-up):

<p align="center">
<img src=img/rdata-popup.png width="350">
</p>

</details>

### Copy the count data and the metadata files

Go the `Terminal` tab next to the R `Console` in the bottom left of the RStudio
window:

```{sh, eval=FALSE}
cd /fs/project/PAS0471/teach/misc/2021-02_rnaseq/$USER

cp ../master/data/meta/metadata.txt data/meta
cp ../master/results/count/gene_counts_all.txt results/count
```

### Load the necessary packages

```{r, warning=FALSE, message=FALSE}
if(! "pacman" %in% installed.packages()) install.packages("pacman")

packages <- c("DESeq2",          # Differential expression analysis
              "tidyverse",       # Misc. data manipulation and plotting
              "here",            # Managing file paths
              "pheatmap",        # Heatmap plot
              "apeglm",          # For LFC shrinkage
              "knitr")           # For table printing in Markdown file
pacman::p_load(char = packages)

theme_set(theme_bw())  # Set ggplot theme
```

### Input and output dirs and files

```{r, eval=TRUE, echo=FALSE}
count_table_file <- "results/count/gene_counts_all.txt"
metadata_file <- "data/meta/metadata.txt"

outdir <- "results/DE/"
plotdir <- "results/DE/fig/"
```

Input files:

- **Gene counts table** -- the file exactly as it was written by *featureCounts*.
  
- **Metadata table** -- so we can group our samples and make comparisons between
  these groups.

```{r, eval=FALSE}
count_table_file <- here("results/count/gene_counts_all.txt")
metadata_file <- here("data/meta/metadata.txt")
```

Output directories -- and we create them if they don't already exist:

```{r}
outdir <- here("results/DE/")
plotdir <- here("results/DE/fig/")

if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
if (!dir.exists(plotdir)) dir.create(plotdir, recursive = TRUE)
```

### Load input data

Load the count table from *featureCounts*:

```{r, eval=TRUE}
raw_counts <- read.table(count_table_file,
                         sep = "\t", header = TRUE, skip = 1)
```

Load the metadata information:

```{r}
metadata <- read.table(metadata_file, header = TRUE)

kable(head(metadata))
```

The `Treatment` column currently has the values `Agrobacterium_noexp`,
`Agrobacterium_myb`, and `mock`.

To shorten this a bit, we'll get rid of "Agrobacterium_":

```{r}
metadata$Treatment <- sub("Agrobacterium_", "", metadata$Treatment)

unique(metadata$Treatment)
```


<br>

----

## Prepare the data

Change the column names, which are very long:

```{r}
colnames(raw_counts)[7:8]
```

```{r}
my_regex <- ".+PonceM_(.+)_V1N.+"
colnames(raw_counts) <- sub(my_regex, "\\1", colnames(raw_counts))

colnames(raw_counts)
```

Besides the counts, there are columns with metadata for each gene:

```{r}
raw_counts[1:5, 1:8]
```

Let's remove those:

```{r}
counts <- raw_counts[, 7:ncol(raw_counts)]
rownames(counts) <- raw_counts$Geneid
```

### Replicate samples

In this table, there are two separate entries for each sample:
each library was sequenced on two lanes.
Recall that in our workflow, we had merged these FASTQ files prior to mapping,
but here we are using the table based on the full dataset produced by Matthew.

So, we will go ahead and merge these replicates now, by simply summing the counts:

```{r}
counts.t <- t(counts)
rownames(counts.t) <- names(raw_counts)[7:36]
sums <- rowsum(counts.t, group = rownames(counts.t))
counts <- t(sums)
```

(Alternatively, one could use the specialized function
`DESeq2::collapseReplicates()` for this.)

### Check sample IDs

For differential expression analysis, we will be using the popular**DESeq2**
R/Bioconductor package
([paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8),
[website](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)).

We will load both the count table and the metadata into *DESeq2*.
When doing so, *DESeq2* assumes that sample IDs in both tables match and 
are provided in the same order. Let's make sure this is indeed the case.

Sort both data frames alphabetically:

```{r}
metadata <- metadata[order(metadata$SampleID), ]
counts <- counts[, order(colnames(counts))]
```

Check if names are the same:

```{r}
metadata$SampleID

colnames(counts)

matching_names <- identical(metadata$SampleID, colnames(counts))
matching_names
if(matching_names == FALSE) stop("Sample ID in metadata and count matrix do not match!")
```

### Create the DESeq2 object

We will create the DESeq2 object using the function `DESeqDataSetFromMatrix()`,
which we will provide with three pieces of information:

- The count data with argument `countData`.
- The metadata with argument `colData`.
- The model design for the DE analysis -- argument `design`.  
  For now, we will specify `~1`, which means "no design" --
  we will change this before the actual DE analysis.

```{r}
dds_raw <- DESeqDataSetFromMatrix(countData = counts,
                                  colData = metadata,
                                  design = ~ 1)
```

### Remove the `Ri_noexp` group

```{r}
#dds_raw <- dds_raw[, -which(dds_raw@colData$group == "Ri_noexp")]
#dds_raw@colData$group <- droplevels(dds_raw@colData$group)

dds_raw <- dds_raw[, -which(dds_raw@colData$Treatment == "noexp")]
#dds_raw@colData$group <- droplevels(dds_raw@colData$group)
```

<br>

----

## Explore the count data

What are number of rows and columns of the count matrix?

```{r}
dim(counts)
```

How many genes have non-zero counts?

```{r}
dim(counts[rowSums(counts) > 0, ])

```

How many genes have total counts of at least 10?

```{r}
dim(counts[rowSums(counts) >= 10, ])
```

### Histogram of gene counts

Let's plot a histogram of gene counts:

```{r}
theme_set(theme_bw())

summed_gene_counts <- data.frame(gene_count = rowSums(counts)) %>%
  rownames_to_column("gene_id")

ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 10000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(expand = c(0,0))
```

Zoom in a bit:

```{r}
ggplot(data = summed_gene_counts) +
  geom_histogram(aes(x = gene_count), binwidth = 1000) +
  scale_y_log10(expand = c(0, 0)) +
  scale_x_continuous(limits = c(0, 200000), expand = c(0,0)) +
  theme(plot.margin = margin(0.5, 0.7, 0.5, 0.5, "cm"))
```

How are counts distributed across samples?
That is, we would like a sum of counts for each column.
To get this, we use the `apply()` function, which can apply a function
(in our case `sum()`) to all columns (hence `MARGIN = 2` -- for rows, use `1`)
of our `counts` dataframe:

```{r}
apply(X = counts, MARGIN = 2, FUN = sum)
```

<br>

----

## Principal Component Analysis (PCA) 

### Run the PCA and prepare for plotting

First, we normalize the count data to have even sampling across samples
(with respect to library size) and approximately even variance:

```{r}
vsd <- varianceStabilizingTransformation(dds_raw, blind = TRUE)
```

Next, we run the PCA and retrieve the data to plot with *ggplot2*:

```{r}
pcaData <- plotPCA(vsd,
                   ntop = 500,
                   intgroup = c("AMF", "Treatment"),
                   returnData = TRUE)
```

We extract the percentage of variance explained by different principal components,
so we can later add this information to the plot:

```{r}
percentVar <- round(100 * attr(pcaData, "percentVar"))
percentVar
```

We create a plot title with the species name in italic using the
somewhat bizarre `expression()` function: 

```{r}
plot_title <- expression("PCA of " * italic(Glycine ~ max) * " transcript count data")
```

### Plot the PCA results

```{r}
ggplot(pcaData,
       aes(x = PC1, y = PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 6) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)
```

### Plot again -- with sample names

```{r}
library(ggrepel)

pca_plot2 <- ggplot(pcaData,
              aes(PC1, PC2, color = AMF, shape = Treatment)) +
  geom_point(size = 3) +
  geom_label_repel(aes(label = name)) +
  xlab(paste0("PC1: ", percentVar[1], "% of variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% of variance")) +
  ggtitle(plot_title)

print(pca_plot2)
```

<br>

----

## DE analysis -- full dataset

The design has two factors: `AMF` and `Treatment`.
Rather than fit a multivariate model, we can start by merging the two into a
single factor called `group`, and fit a univariate model with this factor.

```{r}
dds_raw$group <- factor(paste(dds_raw$AMF, dds_raw$Treatment, sep = "_"))
table(dds_raw$group)
```

We will set the "reference" level of the factor to be the double negative control
(empty substrate, no Agrobacteria):

```{r}
dds_raw$group <- relevel(dds_raw$group, ref = "control_mock")
dds_raw$group
```

Next, we set the analysis design:

```{r}
design(dds_raw) <- formula(~ group)
```

And finally, we perform the differential expression analysis with the `DEseq()`
function:

```{r}
dds <- DESeq(dds_raw)
```

The `DESeq()` function above performs three steps consecutively:
  
- `estimateSizeFactors()` -- "Normalization" by library size and composition.
  
  Note that *DESeq2* doesn’t actually normalize the counts in the sense that
  it produces a matrix with adjusted counts.
  Instead it uses raw counts and includes the size factors in the modeling.
  
  To learn more about gene count normalization, see
  [this video](https://www.youtube.com/watch?v=UFB993xufUU) and
  [this page](https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html).
  
- `estimateDispersions()` -- Estimate gene-wise dispersion (variance in counts).
  
- `nbinomWaldTest(ddsObj)` -- Fit the negative binomial GLM and calculate
  Wald statistics, which is the test statistic underlying the p-value
  for whether a gene is differentially expressed.

These functions could also be called separately,
which would be useful if you want to be able to change more defaults.

### The results table

```{r}
res <- results(dds)
kable(head(res))
```

By default, the results table prints statistics
**comparing the last level of the factor with the first level**:
that is, log-fold change and p-values describe
differences between these two levels specifically.
However, we can easily extract equivalent statistics for any pairwise comparison
among our factor levels, which we will see later.

For now, we will explore what each column in this table means:

- The **`baseMean`** column contains the mean expression level across all samples.

- The **`log2FoldChange`** column contains the log2-fold change of gene counts
  between the compared levels, that is, it represents the *effect size*.
  
  A log2-fold change of 1 indicates that the expression in the reference level
  is two-fold *lower* than that of the other level,
  a log2-fold change of 2 indicates a four-fold difference,
  a log2-fold change of 3 indicates an eight-fold difference, and so on.
  
  Similarly, *negative log2-fold* values indicate a change in gene counts in the
  other direction: the reference level is *higher* than the other level.
  
- The `lfcSE` column indicates the uncertainty in terms of the standard error
  (SE) of the log2-fold change estimate.
  
- The **`stat`** column indicates the value for the Wald test's test statistic.

- The **`pvalue`** column reported the *uncorrected* p-value from the Wald test.

- Because we are testing significance for *many* genes,
  we need to correct for multiple testing.
  DESeq2 uses the Benjamini-Hochberg False Discovery Rate (FDR) correction,
  and these values are reported in the column **`padj`** (i.e., adjusted p-value).

A summary of this information about each column can be seen by running the
`mcols()` function:

```{r}
mcols(res)
```

### `NA` values in the results table

Some values in the results table can be set to `NA` for one of the following reasons:

- If a gene contains a sample with a count **outlier**,
  both the p-value and adjusted p-value will be set to `NA`.
  (DESeq2 performs outlier detection using Cook's distance.)
  
- If all samples have **zero counts** for a given gene,
  the `baseMean` column will be zero,
  and the log2-fold change estimates,
  p-value and adjusted p-value will all be set to `NA`.

- DESeq2 also automatically filters genes with a **low mean count**
  in the sense that it does not include them in the multiple testing correction.
  Therefore, in such cases, the p-value will not be `NA`,
  but the *adjusted* p-value will be.
  
  Because we have very low power to detect differential expression for such
  low-count genes, it is beneficial to remove them prior to the multiple testing
  correction: that way, the correction becomes less severe for the remaining genes.

Let's see how many genes have `NA` p-values:

```{r}
# Number of genes with NA p-value:
sum(is.na(res$pvalue))

# As a proportion of the total number of genes in the test:
sum(is.na(res$pvalue)) / nrow(res)
```

And `NA` adjusted p-values:

```{r}
# Number of genes with NA p-value:
sum(is.na(res$padj))

# As a proportion of the total number of genes in the test:
sum(is.na(res$padj)) / nrow(res)
```

<br>

----

## DE analysis -- contrast two custom groups

Using the `resultsNames` function,
we can see which pairwise contrasts between different levels of the factor are
available (though it is not displayed in a particularly readable fashion):

```{r}
resultsNames(dds)
```

Not all pairwise contrasts between the 5 levels in our `group` factor are available
here: instead, `control_mock`, which we set as the reference level, is being compared
with the other 3 levels.
(However, we can make other pairwise comparisons, too.)

Above, we looked at the results for the *last* of these comparisons
(`group_Ri_myb_vs_control_mock`, i.e. "Ri_myb" vs. "control_mock"),
simply because DESeq2 will show the last comparison by default when calling
the `results()` function.

To see the results table for one of the other 3 comparisons,
we pass a vector to the `contrast` argument of the `results()` function with
the factor (`group`) and the two levels to be contrasted.
For example, to see the results for "Ri_mock" vs. "control_mock":

```{r}
# Here, we could specify *any* pairwise contrast,
# not just the ones with "control_mock" that resultsNames() prints as seen above.
my_contrast <- c("Ri_mock", "control_mock")

res <- results(dds,
               contrast = c("group", my_contrast))
```

How many adjusted p-values were less than 0.1?

```{r}
sum(res$padj < 0.1, na.rm = TRUE)
```

We'll also create an object with *adjusted (shrunken) LFC estimates*,
which will be useful for visualization and ranking of genes by LFC:

```{r}
my_coef <- paste0("group_", paste0(my_contrast, collapse = "_vs_"))
my_coef

res_LFC <- lfcShrink(dds,
                     coef = my_coef,
                     type = "apeglm",
                     lfcThreshold = 1)
```

Here, we had to provide the contrast ("coefficient") in the format
given by `resultsNames(dds)`.
(And that looks a bit confusing because this format uses underscores to separate levels,
while our factor levels themselves also contain underscores.)

We also specified a threshold of 1 for the LFC value (`lfcThreshold = 1`),
so we get s-values (analogous to p-values) that test not for differential
expression of any magnitude (as in the tests above),
but whether the LFC is greater than our specified threshold: 

```{r}
head(res_LFC)

sum(res_LFC$svalue < 0.1, na.rm = TRUE)
```

This can be a useful way to try and tease out statistical from biological significance.

<br>

----

## Visually exploring the results

We will create a few plots, by way of example,
of the results for the "Ri_mock" versus "control_mock" comparison,
which we extracted above.

### MA-plot:

For a nice overview of the results, we can plot a so-called "MA plot".
An MA plot shows, for each gene:

- Count differences in terms of LFC between two groups, on the y-axis.

- Mean counts across both groups, on the x-axis.

We can create an MA plot using DESeq2's `plotMA` function,
with significantly differentially expressed genes displayed in blue:

```{r}
plotMA(res, ylim = c(-5, 5))
```

To be able to customize the plot,
we'll use `returnData = TRUE` like we have done with previous plots,
and then plot the resulting dataframe with *ggplot2*:

```{r}
d <- plotMA(res, returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_y_continuous(limits = c(-10, 10)) +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "LFC")
```

We can see that lowly-expressed genes tend to deviate from an LFC of 0
(same mean expression levels in the two groups) much more than
highly-expressed genes do.
However, this is an artifact of noise overpowering the signal when
expression values are low.
We can also see that no genes in the far left part of the plot are
differentially expressed: this is due to this same lack of power.

DESeq2 provides several methods to adjust LFC estimates for this low-expression
bias.
We used one of those (`lfcShrink()`) above to produce the `res_LFC` object.
Let's create another MA plot with these adjusted LFC estimates:

```{r}
d <- plotMA(res_LFC, ylim = c(-5, 5), returnData = TRUE)

ggplot(d, aes(x = mean, y = lfc, color = isDE)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")
```

Note that significance is now specified for LFC > 1 -- FINISH

Finally, for a plot like this,
it could be useful to be able to identify individual genes.
There are way too many to print the gene names on the plot, though.
Instead, we can also make the plot **interactive with Plotly**,
so we can see the identity of each gene *when we hover over the point*:

```{r}
library(plotly)

# To show the gene name, we need to have a column with gene names.
# Currently, the gene names are row names but ggplot2 (and other tidyverse
# applications) don't like that, so we create a column with gene names:
d$gene <- rownames(d)

# First we create a very similar ggplot to what we did above,
# but we assign gene names to "text":
p_ma <- ggplot(d,
               aes(x = mean, y = lfc, color = isDE, text = gene)) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  scale_color_manual(values = c('grey50', 'blue')) +
  guides(color = FALSE) +
  labs(x = "Mean of normalized counts",
       y = "Shrunken LFC")

# Finally, we make the plot interactive and tell Plotly that it should show
# the "text" (i.e. gene names) as the "tooltip", meaning upon hovering:
ggplotly(p_ma, tooltip = "text")
```

### Plot specific genes

We can also create plot of expression levels for individual genes.
That is especially interesting for genes with highly significant differential
expression.

Let's plot the top-5 most significantly differentially expressed genes:

```{r, eval=TRUE}
# First, we select the 5 genes with the lowest adjusted p-value:
top5 <- row.names(res[order(res$padj)[1:5], ])
```

```{r, eval=TRUE}
# Then we create a function to make a plot for a single gene:
plotgene <- function(geneID, dds) {
  
  d <- plotCounts(dds,
                  gene = geneID,
                  intgroup = "group",
                  returnData = TRUE)

  p <- ggplot(d, aes(x = group, y = count)) +
          geom_point(position = position_jitter(w = 0.1, h = 0)) +
          labs(title = geneID) +
          theme_bw()
  
  print(p)
}
```

Finally, we use `sapply()` to apply this function to each of our genes in the
`top5` vector.

```{r}
none <- sapply(top5, plotgene, dds)
```

If we wanted to, we could easily create plots for 100s of genes, this way.

### Heatmap

We can create heatmaps with the `pheatmap` function.
Let's start by creating a function that will plot a heatmap given a vector
of gene IDs and a DESeq2 object `dds`:

```{r, eval=TRUE}
plot_heatmap <- function(geneIDs, dds) {
  
  ntd <- assay(normTransform(dds))
  
  ntd_sel <- ntd[match(geneIDs, rownames(ntd)), ]
  df_meta <- as.data.frame(colData(dds)[, c("AMF", "Treatment")])
  
  pheatmap(ntd_sel,
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         annotation_col = df_meta)
}
```

Now, we can easily create a heatmap for the top-20 most highly differentially
expressed genes:

```{r, eval=TRUE}
top20_DE <- row.names(res[order(res$padj)[1:20], ])
plot_heatmap(top20_DE, dds)
```

Or for the 20 most highly expressed genes:

```{r, eval=TRUE}
top20_hi_idx <- order(rowMeans(counts(dds, normalized = TRUE)),
                  decreasing = TRUE)[1:20]
top20_hi <- row.names(dds)[top20_hi_idx]
plot_heatmap(top20_hi, dds)
```

### Export the results

Let's save the results dataframe to file.

Note that it will only contain the results for one comparison.
Also, if we write the results dataframe to file,
we won't be able to tell from the file what the comparison is, so let's store
that in two columns:

```{r}
my_contrast

my_contrast_pasted <- paste0(my_contrast, collapse = "_vs_")
my_contrast_pasted
```

```{r}
res$level1 <- my_contrast[1]
res$level2 <- my_contrast[2]
kable(head(res))
```

Now we can write `res` to file:

```{r, eval=FALSE}
res_file <- file.path(outdir, paste0(my_contrast_pasted, '_all-res.txt'))

write.table(res, res_file,
            sep = '\t', row.names = TRUE, quote = FALSE)
```

We may also want to save a separate table with only significant results:

```{r, eval=FALSE}
res_sig_file <- file.path(outdir, paste0(my_contrast_pasted, '_sig-res.txt'))

res_sig <- res %>%
  as.data.frame() %>% 
  dplyr::filter(padj < 0.1)

write.table(res_sig, res_sig_file,
            sep = '\t', row.names = TRUE, quote = FALSE)
```

<br>

----

## DE analysis -- all pairwise comparisons

To run the DE analysis for all pairwise comparisons,
we will start by writing a function that takes a comparison (contrast) in the
form of `c(level1, level2)` along with a DESeq2 object (`dds`),
and output a data frame with significantly differentially expressed genes:

```{r}
sig_contrast <- function(my_contrast, dds) {
  
  res_sig <- results(dds,
                     contrast = c("group", my_contrast)) %>%
    as.data.frame() %>% 
    dplyr::filter(padj < 0.1) %>%
    mutate(level1 = my_contrast[1],
           level2 = my_contrast[2])
  
  cat(my_contrast[1], "versus", my_contrast[2], ":", nrow(res_sig), "significant\n")
  
  return(res_sig)
}
```

Then, we will create a list with all pairwise combinations of our `group` factor
using the `combn()` function:

```{r}
comps <- combn(levels(dds@colData$group), 2, simplify = FALSE)
```

Finally, we run the function for all of our pairwise comparisons,
and using `do.call(rbind, ...)`, we concatenate all the results in a single
data frame:

```{r}
sig_all_contrasts <- do.call(rbind, lapply(comps, sig_contrast, dds))

# The `kable()` function is just to display tables nicely in this R Markdown
# document -- use `head(sig_all_contrasts)` in your own code.
kable(head(sig_all_contrasts))
```

----

## DE analysis -- with two factors

### Controlling for one factor

Say we wanted to analyze the effect of "mock" versus "myb"
("Treatment" column) while controlling for the effects of "control" versus "Ri"
("AMF" column).

Let's start by turning "Treatment" and "AMF" into factors,
and saving a new DESeq2 object:

```{r}
# Convert Treatment and AMF into factors: 
dds_raw$Treatment <- relevel(factor(dds_raw$Treatment), ref = "mock")
dds_raw$AMF <- relevel(factor(dds_raw$AMF), ref = "control")

# Save a new object:
dds_2f_raw <- dds_raw
```

To include both factors, we use a `+` in the formula.
Note that the order matters:
using `AMF + Treatment`, we test for the effect of "Treatment" (the last factor),
while controlling for the effect of AMF (the first factor).

```{r}
design(dds_2f_raw) <- formula(~ AMF + Treatment)
```

Run DESeq with the new design:

```{r}
dds_2f <- DESeq(dds_2f_raw)
res <- results(dds_2f)
```

How many adjusted p-values were less than 0.1?

```{r}
sum(res$padj < 0.1, na.rm = TRUE)
```

### With an interaction term

We can add an interaction term using notation like `level1:level2`, with a colon.
In this case, we want to include both "AMF", "Treatment", and the interaction
between the two:

```{r}
# Save a new object:
dds_2fi_raw <- dds_raw

# The interaction term is `AMF:Treatment`:
design(dds_2fi_raw) <- formula(~ AMF + Treatment + AMF:Treatment)
```

Run DESeq with the new design:

```{r}
dds_2fi <- DESeq(dds_2fi_raw)

resultsNames(dds_2fi)
```

From the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#interactions):
>The key point to remember about designs with interaction terms is that, unlike for a design `~ genotype + condition`, where the condition effect represents the overall effect controlling for differences due to genotype, by adding `genotype:condition`, the main condition effect only represents the effect of condition for the reference level of genotype (I, or whichever level was defined by the user as the reference level). The interaction terms genotypeII.conditionB and genotypeIII.conditionB give the difference between the condition effect for a given genotype and the condition effect for the reference genotype.

Now, we can check a number of different effects:

#### Does the effects of Treatment differ among levels of AMF?

This is described by the interaction term, which came last in our formula.
Therefore, the `results()` function will get results for the interaction term by default:

```{r}
res <- results(dds_2fi)
kable(head(res))
```

#### The effect of Treatment in "control" AMF: 

```{r}
res <- results(dds_2fi,
               contrast = c("Treatment", "myb", "mock"))
kable(head(res))
```

#### The effect of Treatment in "myb" AMF: 

```{r}
res <- results(dds_2fi,
               contrast = list(c("Treatment_myb_vs_mock", "AMFRi.Treatmentmyb")))
kable(head(res))
```

#### The effect of AMF in the "mock" Treatment:

```{r}
res <- results(dds_2fi,
                contrast = c("AMF", "Ri", "control"))
kable(head(res))
```

#### The effect of AMF in "myb" Treatment:

```{r}
res <- results(dds_2fi,
               contrast = list(c("AMF_Ri_vs_control", "AMFRi.Treatmentmyb")))
kable(head(res))
```
