We've meticulously cleaned our raw data. Now, we need to figure out where each read came from by aligning it to a reference genome. For RNA-Seq, this is tricky because of splicing, where introns are removed from the final transcript. This means we need a special splice-aware aligner.

Our tool of choice is STAR (Spliced Transcripts Alignment to a Reference), prized for its speed and accuracy.

The Inputs and Outputs of Alignment

Inputs: A reference genome (FASTA), a gene annotation (GTF), and our cleaned FASTQ files.
Main Output: A BAM (Binary Alignment Map) file, which records where each read landed on the genome.

The Snakemake Connection: Building the Workflow

Let's see how the QC and alignment steps connect in a single, powerful Snakefile. This example shows how Snakemake uses input and output to chain rules together, from raw reads to a final summary report.

Assume we have a config.yaml file that defines our samples, genome path, etc.

# Snakefile

# First, load our project's configuration
from pathlib import Path

configfile: "config.yaml"

# The 'rule all' is the final target we want to build.
# By asking for the multiqc_report, Snakemake will automatically
# figure out all the intermediate steps needed to create it.
rule all:
    input:
        "results/multiqc/multiqc_report.html"

# --- PREPROCESSING ---
def get_reads(wildcards):
    """ Simple helper function to return reads with sample and read wildcards"""
    return config["samples"][wildcards.sample]["fastqs"][wildcards.read]

# This rule defines how to merge reads for any sample and read pair (R1/R2)
rule merge_reads:
    input:
        reads = get_reads,
    output:
        # The file we want to create, e.g., "data/merged/sampleB_R1.fastq.gz"
        merged = "results/merged/{sample}_{read}.fastq.gz"
    params:
        # Consistent parent directory creation even if you decide
        # to change output path later on.
        outdir = lambda wcs, output: Path(output.merged).parent
    run:
        print(f"Detected reads: {input.reads}")
        shell(f"mkdir -p {params.outdir}")
        if len(input.reads) > 1:
            print(f"Executing: cat {input.reads} > {output.merged} ")
            shell(f"cat {input.reads} > {output.merged} ")
        else:
            print(f"Executing: ln -sf {input.reads} {output.merged} ")
            shell(f"ln -sf {input.reads} {output.merged} ")

# --- QUALITY CONTROL ---
# Rule to run fastp for adapter and quality trimming
rule fastp:
    input:
        r1 = lambda wcs: expand(rules.merge_reads.output.merged, sample=wcs.sample, read="R1"),
        r2 = lambda wcs: expand(rules.merge_reads.output.merged, sample=wcs.sample, read="R2"),
    output:
        r1 = "results/fastp/{sample}_R1.fastq.gz",
        r2 = "results/fastp/{sample}_R2.fastq.gz",
        html = "results/fastp/{sample}.html"
    shell:
        "fastp --in1 {input.r1} --in2 {input.r2} --out1 {output.r1} --out2 {output.r2} --html {output.html}"

# Rule to run FastQC on the cleaned reads for a final check
rule fastqc:
    input:
        r1 = rules.fastp.output.r1,
        r2 = rules.fastp.output.r2
    output:
        "results/fastqc/{sample}_R1_fastqc.zip",
        "results/fastqc/{sample}_R2_fastqc.zip"
    shell:
        "fastqc -o results/fastqc {input.r1} {input.r2}"


# --- ALIGNMENT ---

# Rule to build the STAR genome index. This is a one-time setup.
rule star_index:
    input:
        genome = config["reference"]["genome_fasta"],
        gtf = config["reference"]["annotation_gtf"]
    output:
        # STAR creates a directory with many files, so we mark it as such
        directory("results/star_index")
    threads: 8
    shell:
        "STAR --runMode genomeGenerate --genomeDir {output} "
        "--genomeFastaFiles {input.genome} --sjdbGTFfile {input.gtf} --runThreadN {threads}"

# Rule to align the cleaned reads of each sample to the indexed genome
rule star_align:
    input:
        r1 = rules.fastp.output.r1,
        r2 = rules.fastp.output.r2
        # This rule depends on the index being created first
        index = rules.star_index.output
    output:
        # The main output is a sorted BAM file
        bam = "results/alignment/{sample}.bam"
    log:
        "logs/star/{sample}.log"
    threads: 8
    shell:
        "STAR --runMode alignReads --genomeDir {input.index} --runThreadN {threads} "
        "--readFilesIn {input.r1} {input.r2} --readFilesCommand zcat "
        "--outSAMtype BAM SortedByCoordinate --outFileNamePrefix results/alignment/{wildcards.sample}_"


# --- AGGREGATION ---

# Rule to run MultiQC and aggregate all our logs and reports
rule multiqc:
    input:
        # Gather all the fastp reports
        fastp = expand("results/fastp/{sample}.html", sample=config["samples"]),
        # Gather all the FastQC reports
        fastqc = expand("results/fastqc/{sample}_{read}_fastqc.zip", sample=config["samples"], read=["R1", "R2"]),
        # Gather all the STAR alignment logs
        star_logs = expand("logs/star/{sample}.log", sample=config["samples"])
    output:
        "results/multiqc/multiqc_report.html"
    shell:
        "multiqc --force -o results/multiqc {input}"

This Snakefile provides a complete, automated workflow. If you ask Snakemake to generate the final MultiQC report, it knows it needs the alignment logs and QC files. To get the alignment logs, it needs to run the alignment, which in turn needs the cleaned FASTQ files from fastp. Snakemake builds this entire dependency graph and executes the steps in the correct order, with the correct parameters, every single time.

Next Up

We've now mapped our reads to genomic locations. The next logical step is to count how many reads fall into each gene. This process, called Quantification, is the subject of our next post and is the final step before we can look for interesting biological changes.

Finding a Home: Aligning RNA-Seq Reads to the Genome

The Inputs and Outputs of Alignment

The Snakemake Connection: Building the Workflow

Next Up

Subscribe to my newsletter

Analog Dreamer

Analog Dreamer