Guide to Analyzing Raw RNA-Seq Data

In our last post, we made the case for using workflow managers. Now, let's apply that philosophy to a real-world scenario: analyzing RNA-sequencing (RNA-Seq) data. This process, which takes us from raw sequencer output to biological insight, will be our case study for building a powerful Snakemake workflow.

This post focuses on the very first, and most fundamental, step: understanding your raw data files.

The Starting Point: FASTQ Files

Your journey begins with FASTQ files. These are the raw output from the sequencing machine. Each file contains millions of short "reads," which are snippets of the RNA sequences from your biological sample. For each read, the FASTQ file stores the sequence of nucleotide bases (A, C, G, T) and a corresponding quality score for each base.

Paired-End Reads: Why Are There Two Files?

You'll almost always get two FASTQ files per sample, typically named with _R1 and _R2 suffixes. This is because of paired-end sequencing, where the machine reads a short sequence from both ends of the same RNA fragment. The R1 file contains the "forward" reads, and the R2 file contains their corresponding "reverse" partners. Knowing that two reads form a pair is incredibly useful for accurately aligning them to the genome.

Technical vs. Biological Replicates: To Merge or Not to Merge?

What if you receive multiple pairs of FASTQ files for a single sample (e.g., from different sequencing lanes)? These are technical replicates, and they should be merged. You need to combine all the R1 files into one and all the R2 files into another to get a complete picture of your sample.

Let's Make it Real: A Snakemake Merge Rule

Instead of manually concatenating files, which is tedious and error-prone, let's define a Snakemake rule to do it for us. This example shows how Snakemake can automatically handle this task for any sample you define.

1. Create a Dummy Config File

First, create a file named dummy_config.yaml. This file tells Snakemake which samples need merging and what their components are.

# dummy_config.yaml
samples:
  sample_A:
    fastqs:
      r1: ["path/to/sampleA_lane1_r1.fastq.gz", path/to/sampleA_lane2_r1.fastq.gz"]
      r2: ["path/to/sampleA_lane1_r2.fastq.gz", path/to/sampleA_lane2_r2.fastq.gz"]

  sample_B:
    fastqs:
      r1: ["path/to/sampleB_lane1_r1.fastq.gz", path/to/sampleB_lane2_r1.fastq.gz"]
      r2: ["path/to/sampleB_lane1_r2.fastq.gz", path/to/sampleB_lane2_r2.fastq.gz"]

2. Create the Snakefile

Next, create a file named Snakefile. This contains the rule that tells Snakemake how to perform the merge.

# Snakefile

# First, tell Snakemake to load our configuration
configfile: "dummy_config.yaml"

def get_reads(wildcards):
    """ Simple helper function to return reads with sample and read wildcards"""
    return config["samples"][wildcards.sample]["fastqs"][wildcards.read]

# This rule defines how to merge reads for any sample and read pair (R1/R2)
rule merge_reads:
    input:
        reads = get_reads,
    output:
        # The file we want to create, e.g., "data/merged/sampleB_R1.fastq.gz"
        merged = "data/merged/{sample}_{read}.fastq.gz"
    shell:
        # The command to run. Snakemake automatically fills in {input} and {output}.
        "cat {input.reads} > {output.merged}"

3. Execute the Workflow

Now, from your terminal, ask Snakemake to create one of the target files. Snakemake will figure out the rest.

I will first create the directory data/merged where the output will be stored.

mkdir -p data/merged

Now, run Snakemake:

snakemake --snakefile Snakefile --configfile dummy_config.yaml data/merged/sampleB_R1.fastq.gz --cores 1

Snakemake will read the Snakefile, see that data/merged/sampleB_R1.fastq.gz can be created by the merge_reads rule, find the correct input files using the dummy_config.yaml, and run the cat command to generate your merged file.

This is the power of Snakemake: you declare the result you want, and it executes the necessary steps.

Next Up

Now that we understand our raw data and have seen how to handle it, we must ensure it's high quality. In the next post, we'll dive into Quality Control (QC) with tools like FastQC and fastp, a critical step for any robust analysis.

⏭

Garbage In, Garbage Out: A Practical Guide to QC for RNA-Seq

Deconstructing the Data: A Guide to Raw RNA-Seq Files