Counting the Evidence: Quantifying Gene Expression

Analog DreamerAnalog Dreamer
3 min read

In our last post, we successfully aligned our clean reads to the genome, generating BAM files that tell us the genomic origin of each read. We're now one step away from the exciting biological questions. The next task is to convert this mapping information into a simple number for each gene: a count of how many reads belong to it. This process is called quantification.

The resulting table of counts, known as a counts matrix, will be our proxy for gene expression and the direct input for our final analysis.

The Concept: From Alignments to Counts

The basic idea is straightforward. We have two key pieces of information:

  1. A list of all known genes and their exon coordinates (from our GTF annotation file).

  2. The coordinates of all our mapped reads (from our BAM alignment files).

The goal is to overlay these two sets of information and count, for each gene, how many reads map within its exons.

Our Tool of Choice: featureCounts

There are many tools for this job, but we'll use featureCounts. It is extremely fast, efficient, and one of the most widely cited tools for read counting. It takes our BAM files and a GTF file and produces a clean, simple matrix of counts.

(Note: You may also hear about other tools like Salmon or Kallisto. These are fantastic, incredibly fast "pseudo-alignment" tools that quantify expression by mapping reads directly to a transcriptome. For a classic, alignment-based workflow like ours, featureCounts is the standard and robust choice that follows naturally from STAR alignment.)

Inputs and Outputs

  • Inputs:

    1. The BAM files for all of our samples.

    2. The gene annotation GTF file.

  • Output:

    1. A single text file: the counts matrix.

This matrix is the cornerstone of our downstream analysis. It's structured as a simple table:

  • Each row represents a gene.

  • Each column represents a sample.

  • The value in each cell is the raw number of reads from that sample that were assigned to that gene.

The Snakemake Connection

Of course, we will automate this with a Snakemake rule. This rule is a great example of an "aggregation" step, where we take the results from many samples and combine them into a single summary file.

# In our Snakefile

# Rule to count reads in features (genes) using featureCounts
rule featurecounts:
    input:
        # This rule needs the BAM files for all of our samples
        bams = expand(rules.alignment.output.bam, sample=config["samples"]),
        # It also needs the annotation file
        gtf = config["reference"]["annotation_gtf"]
    output:
        # The final output is a single matrix of counts
        "results/quantification/counts.tsv"
    log:
        "logs/featurecounts.log"
    threads: 8
    shell:
        "featureCounts -p -t exon -g gene_id "  # -p for paired-end, count exons, group by gene_id
        "-a {input.gtf} "                      # The annotation file
        "-o {output} "                         # The output file
        "-T {threads} "                        # Number of threads
        "{input.bams}"                         # The list of all input BAM files
        "&> {log}"                             # Redirect logging

This rule takes all the individual sample alignments and neatly summarizes them into the final counts matrix, which is exactly what we need for the last step.

Next Up

We've finally arrived. We have a clean, quantified matrix of gene expression. Now we can ask the ultimate question: "Which genes show a statistically significant change in expression between my experimental conditions?" In our final post of this core series, we'll explore Differential Expression Analysis.


0
Subscribe to my newsletter

Read articles from Analog Dreamer directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Analog Dreamer
Analog Dreamer