The Payoff: Finding Meaning with Differential Expression Analysis

Analog DreamerAnalog Dreamer
4 min read

This is the moment we've been working towards. We have journeyed from raw, messy sequencing reads, through meticulous quality control, alignment, and quantification. The result of that journey is a clean, simple counts matrix: a table where rows are genes, columns are our samples, and the values are the expression level.

Now, we can finally ask the core biological question that motivated the experiment: Which genes show a statistically significant change in expression between our conditions?

Why Not Just a T-test?

It's tempting to just go down the rows of our counts matrix and run a simple statistical test, like a t-test, to compare the "control" samples to the "treatment" samples for each gene. However, this is not appropriate for RNA-Seq data for several key reasons:

  1. RNA-Seq data is counts, not continuous values. This kind of data doesn't follow the normal distribution (the "bell curve") that a t-test assumes. Instead, it's better modeled by other distributions, like the Negative Binomial.

  2. We usually have few replicates. Most experiments only have 3-5 replicates per condition. With so few data points, it's impossible to get a reliable estimate of the variance for each individual gene.

  3. The mean and variance are linked. In RNA-Seq data, genes with higher expression tend to be more variable. A simple t-test doesn't account for this relationship.

Our Tool of Choice: DESeq2

To solve these problems, we use specialized tools. DESeq2, an R package, is one of the most popular and robust tools for this task. It is specifically designed to handle the statistical challenges of RNA-Seq data.

In simple terms, here's how DESeq2 works its magic:

  • It uses the Negative Binomial distribution to model the count data correctly.

  • Crucially, it shares information across all genes to get more stable and reliable estimates of variance. It "borrows strength" from genes with similar expression levels to improve its statistical power, which is essential when working with few replicates.

  • It performs rigorous statistical testing and corrects for the fact that we're testing thousands of genes simultaneously (multiple testing correction).

Inputs and Outputs

  • Inputs:

    1. The counts matrix from featureCounts.

    2. A metadata table (or "sample sheet"). This is a simple text file that links each sample to its experimental condition (e.g., sampleA, control, sampleB, treatment).

    3. A design formula (e.g., ~ condition). This tells DESeq2 the main factor we want to test.

  • Main Output: A results table. This table is our key result, containing a row for each gene and several important columns, including:

    • log2FoldChange: How much the gene's expression changed (on a log-2 scale). A value of 1 means a 2-fold increase; -2 means a 4-fold decrease.

    • pvalue: The raw p-value from the statistical test.

    • padj: The adjusted p-value. This is the most important column. It is corrected for multiple testing, and we typically use a threshold here (e.g., padj < 0.05) to get our final list of significantly differentially expressed genes.

The Snakemake Connection

We can integrate this R-based analysis seamlessly into our workflow using Snakemake's script directive. This allows us to keep our R code separate and clean while still having Snakemake manage its execution.

1. The R Script (scripts/deseq2.R)

# scripts/deseq2.R

# Load the library
library("DESeq2")

# Read the arguments from the Snakefile
counts_file <- snakemake@input[["counts"]]
samples_file <- snakemake@input[["samples"]]
output_file <- snakemake@output[[1]]

# Read the data
count_data <- as.matrix(read.csv(counts_file, row.names="gene_id"))
col_data <- read.csv(samples_file, row.names="sample")

# Create the DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = col_data,
                              design = ~ condition)

# Run the analysis
dds <- DESeq(dds)
res <- results(dds)

# Write the results to the output file
write.csv(as.data.frame(res), file=output_file)

2. The Snakemake Rule

# In our Snakefile

rule deseq2:
    input:
        counts = "results/quantification/counts.tsv",
        samples = config["samples_metadata"] # Path to the metadata file from config
    output:
        "results/de_analysis/treatment_vs_control.csv"
    log:
        "logs/deseq2.log"
    script:
        "scripts/deseq2.R"

This powerful combination lets us use the best tool for each job—R for statistics, Python/Snakemake for workflow management—in a single, reproducible pipeline.

Conclusion and Next Steps

We've done it! From raw reads to a final, interpretable list of differentially expressed genes. This list can be used to generate hypotheses, explore biological pathways, and drive new research.

This concludes our core tutorial on building an RNA-Seq pipeline. We've seen the "why" and "how" of each major step. But there's still more to learn. In our upcoming posts, we'll explore advanced Snakemake features for making our workflow even more robust, and we'll look to the future by seeing how AI tools like DSPy can help us interpret our results.


0
Subscribe to my newsletter

Read articles from Analog Dreamer directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Analog Dreamer
Analog Dreamer