Beginner's Guide to Gene Expression Prediction

Introduction

Imagine you could predict how a cell behaves when you tweak its genes—without ever stepping into a lab! That’s the exciting world of the Virtual Cell Challenge, a 2025 competition by the Arc Institute. It’s like a game where your AI model acts as a "cell fortune teller," guessing how silencing a gene changes a cell’s activity. This blog will break down the challenge, focusing on the validation data, with simple examples to help you get started—even if you’re new to biology or AI!

What’s the Virtual Cell Challenge?

Scientists can tweak these genes using tools like CRISPR to study diseases or drugs. But this takes time and money. The challenge asks us to build AI models to simulate these changes virtually, saving effort and speeding up discoveries.

Goal: Predict how silencing a gene affects all other genes in a stem cell (H1 embryonic stem cell line).
Data: We get training data (known outcomes) and a validation set (new puzzles to solve).
Prize: A leaderboard to see whose model is the best "fortune teller"!

The Training Data: `adata_training.h5ad`

This is your primary dataset for training your model.

What it is: A large file (15 GB) in the AnnData (H5AD) format, which is standard for single-cell data.
Contents: It contains the gene expression data for 221,273 cells across 18,080 genes(.X matrix)
Key Information: Metadata
- obs (Observations/Cells): For each cell, you have:
  - target_gene: The gene that was perturbed (or 'non-targeting' for control cells).
  - guide_id: The specific guide RNA used for the perturbation.
  - batch: Information about the experimental batch.
- var (Variables/Genes): For each gene, you have its gene_id

1. The Core Data Matrix (`.X`) 📊

This is the heart of the dataset. It's a massive table with 221,273 rows (cells) and 18,080 columns (genes). Each number in this table represents the expression level of a specific gene in a single cell. A higher number means the gene is more active.

Most of them think their scRNA-seq matrix is raw
It's not.

1.1 What is a UMI?

UMI = Unique Molecular Identifier
It's a short random barcode (e.g., 8–12 nucleotides) added to each RNA molecule during library preparation in single-cell RNA sequencing (scRNA-seq).

If GeneA = 50 UMIs in Cell1 → it means approximately 50 RNA molecules of GeneA were present
Easy Easy…We are getting into the core.

1. Raw UMI Counts (Starting Point)

Gene	Cell 1	Cell 2
G1	200	500
G2	50	150
G3	0	300

1.2. Normalization (`sc.pp.normalize_total`)

Goal:
Make all cells comparable by scaling them to the same total number of counts (UMIs).
For each cell, calculate a scaling factor:

Example (Cell 1: total = 250, target = 10,000):
Scale factor = 10,000 ÷ 250 = 40
- G1: 200 × 40 = 8,000
- G2: 50 × 40 = 2,000
- G3: 0 × 40 = 0

1.3. Log1p Transformation (`sc.pp.log1p`)

Goal:
Compress the range so very high counts don’t dominate analyses.
What it does:

Example:
- G1: log1p(8,000) ≈ 8.99
- G2: log1p(2,000) ≈ 7.60
- G3: log1p(0) = 0

Now we can say about how the values are stored inside the .X matrix
One thing to note :

0 = no detected RNA for that gene in that cell
Higher values = more RNA molecules (after scaling)
Not raw counts—a comparable measure of expression across cells.

2. Cell Metadata (`.obs`) 🏷️

This part of the data provides labels and context for each of the 221,273 cells (the rows).

cell barcode-batch index
- Cell Barcode (e.g., AAACAAGCAACCTTGT): This is a short DNA sequence that gets attached.
- Batch Index (e.g., Flex_1_01): This tells you which experimental group, or "batch," the cell was processed in.
target_gene 🎯
- What it does: This tells you which gene was intentionally disrupted or "knocked out" in that specific cell.
- Special Case: 'non-targeting': This value identifies the 38,176 control cells. These are healthy, unperturbed cells that serve as a vital baseline. Your model needs to look at a "non-targeting" cell to understand what "normal" is. Only then can it understand how much the cell changes when a gene is actually targeted.
guide_id
- What it does: This is a more specific identifier for the molecular tool (the guide RNA) used to target the gene. Sometimes, multiple different guides are used to target the same gene to ensure the effect is real.
batch 🧪
- What it does: This identifies the experimental batch the cell was processed in. Experiments of this scale are often done in multiple groups or "batches" over time.
- Why it's here: To correct for batch effects. Cells from different batches might have minor differences in lab conditions, not because of the biology. This label allows your model to recognize and ignore these technical variations, focusing only on the true biological changes caused by the gene perturbation.

3. Gene Metadata (`.var`) 🧬

This provides information about each of the 18,080 genes (the columns).

gene_name (e.g., SAMD11)
- What it does: These are the common, human-readable symbols for the genes.
gene_id (e.g., ENSG00000187634)
- What it does: This is a unique and stable identifier from the Ensembl database.
- Why it's here: Gene symbols can sometimes have aliases or change over time. The Ensembl ID is a permanent identifier that prevents any ambiguity, ensuring your model always knows exactly which gene is being referred to.

The Validation Data

The validation set is where you test your skills with 50 new genes. The file pert_counts.Validation.csv (1 KB, 50 rows) is your guide. Let’s look at an example row:

target_gene: Example : SH3BP4
n_cells: 2,925
median_umi_per_cell: 54,551

This means: Predict how silencing SH3BP4 changes all 18,080 genes across 2,925 virtual cells, with a typical total expression of 54,551 per cell.

The Prediction Task: Cooking Your Virtual Cells

Your job is to create a matrix like training data with the same format of anndata for SH3BP4:

Rows: 2,925 virtual cells (like 2,925 plates of food).
Columns: 18,080 genes (ingredients in each plate).
Numbers: Expression levels (how much of each ingredient changes).

Example with 3 Genes

Let’s simplify to 3 genes (A, B, C) and 3 cells:

Training Insight: Silencing SH3BP4 reduces Gene A by 20%, increases Gene B by 10%, and leaves Gene C unchanged. Control totals are 54,551.

Predictions:
- Cell 1: Gene A = 8,000, Gene B = 22,000, Gene C = 24,551 (Total = 54,551)
- Cell 2: Gene A = 8,400, Gene B = 20,900, Gene C = 25,042 (Total = 54,342)
- Cell 3: Gene A = 7,200, Gene B = 24,200, Gene C = 24,060 (Total = 55,460)
Control (non-perturbed) cells have average expression: Gene A = 10,000, Gene B = 20,000, Gene C = 24,551 (total = 54,551 per cell).

Generate Predictions for 2,925 Virtual Cells

Let’s predict expression for 3 sample cells out of the 2,925. We’ll add some variation to mimic real cell differences:

Cell 1:
- Gene A: 10,000 × (1 - 0.20) = 8,000
- Gene B: 20,000 × (1 + 0.10) = 22,000
- Gene C: 24,551 × (1 + 0.00) = 24,551
- Total: 8,000 + 22,000 + 24,551 = 54,551
Cell 2 (with slight variation, e.g., 5% more noise):
- Gene A: 8,000 × 1.05 = 8,400
- Gene B: 22,000 × 0.95 = 20,900
- Gene C: 24,551 × 1.02 = 25,042
- Total: 8,400 + 20,900 + 25,042 = 54,342
Cell 3 (with more variation, e.g., 10% less noise):
- Gene A: 8,000 × 0.90 = 7,200
- Gene B: 22,000 × 1.10 = 24,200
- Gene C: 24,551 × 0.98 = 24,060
- Total: 7,200 + 24,200 + 24,060 = 55,460

This tiny matrix looks like:

Cell	Gene A	Gene B	Gene C	Total
Cell 1	8,000	22,000	24,551	54,551
Cell 2	8,400	20,900	25,042	54,342
Cell 3	7,200	24,200	24,060	55,460

For the full 2,925 cells, you’d generate 2,925 rows like this, with varying totals.

Check the Median

List the totals for all 2,925 cells (e.g., 54,551, 54,342, 55,460, …).
Sort them and find the middle value (median). Let’s say after generating, the median is 53,000 (hypothetical).
Since 53,000 is below 54,551, adjust all values. Multiply each cell’s expression by 54,551 / 53,000 ≈ 1.03 to scale up.

New adjusted totals might become:

Cell 1: 54,551 × 1.03 ≈ 56,187
Cell 2: 54,342 × 1.03 ≈ 55,972
Cell 3: 55,460 × 1.03 ≈ 57,124
Median should now be closer to 54,551.

Why This Matters

Science Boost: Accurate predictions can simulate experiments, helping find cures faster.
Fun Challenge: Compete on a leaderboard, refining your model with feedback.
Real Impact: It’s a step toward AI understanding life’s building blocks!

Tips for Beginners

Start Simple: Use control cell averages as a baseline, then tweak based on SH3BP4’s effects.
Tools: Use Python with libraries like Scanpy to handle AnnData files.
Practice: Generate a small matrix, check the median, and adjust.

Final Words:

Thanks for giving your valuable time to read!

Self-learning brings its own unique style of building and growing.
If you are like me give your hand, Let’s connect : Linkedin , X

Follow my journey via #100DaysofAIEngineer on X to see my daily work and progress.

Keep Building! Make your Hands Dirty!

Conclusion

The Virtual Cell Challenge is like a cooking contest where your AI predicts new recipes. The validation data (e.g., SH3BP4 with 2,925 cells and 54,551 median) guides you to make realistic virtual cells. Start small, learn from the training data, and have fun predicting! If you’re stuck, experiment and ask for help—science loves curiosity!

Ready to try? Grab the data and code your first prediction!

Understanding the Virtual Cell Challenge: A Beginner’s Guide to Predicting Gene Expression

Table of contents

Introduction

What’s the Virtual Cell Challenge?