In the spirit of “Learning in Public,” and “learning exhaust,” I’m adding “Lab Notes” blog posts that chronicle little discoveries or failures that I have along the way to a larger goal. This is a followup to my previous Lab Notes: Comparing GenePT and scGPT post, this time detailing how to customize the GenePT embeddings to a task, and doing a quick comparison to the previous one.

I’m collaborating with Eugene Bolotin from miraomics.bio to understand AI landscape for Single Cell and Spatial Transcriptomics. If you don’t understand what that means, we’re going to do a series of blog posts soon that give lots of details to get the AI and Bio people on the same page.

A meh result

Ok, above I said “little discoveries or failures,” and this qualifies as a “meh” success, but I learned from it, and so can you. The TL;DR is that using the specialized embeddings for classification that I created seem to help… a bit. We can still play with the text embeddings to figure out if a better prompt might help.

With that said, here’s how you can repeat the experiment and fiddle:

Leverage the giant LLMs!

The main idea of GenePT is that we can use the knowledge about the world that is baked into the OpenAI embedding models to embed genes in a meaningful semantic space that will allow us to perform downstream tasks like classification, clustering, perturbation analysis and more complex tasks that embeddings allow you to do. It does this by taking text describing the gene and passing that through the embedding model. Please see my previous post for more details on how the model actually works.

The original GenePT model uses the NCBI description of each gene as the text, but we actually use a more elaborate description that uses both the NCBI description and also other information about each gene from OpenAI’s gpt4o-mini. So we are using two models now: the gpt4o-mini to generate a gene description based on it’s world knowledge (academic papers, databases, etc.), and using the OpenAI embedding model to turn the description into a fixed length vector.

This is a pretty cool idea for a few reasons:

It works! See the GenePT paper for analysis
It’s fast! On my MacBook Pro with a M3 processor and 16GB of ram, using a single thread I can embed ~425 cells per second, compared to ~350 cells per second using a NVIDIA A10 GPU with scGPT. Stay tuned for performance optimizations that will let us embed 100 million cells in a reasonable amount of time on a CPU.
We can avoid training (and re-training) sophisticated models like scGPT, Geneformer, etc, since the hard work of training the model has already been done by OpenAI
Most important for this post: we can tailor the gene descriptions to the downstream task. That’s the idea behind this experiment.

I will discuss some other ideas we have for how we might be able to use the model in other interesting ways.

Your very own gene descriptions

Ok, so how do we go about creating custom gene descriptions? Lets continue focusing on cell type classification. Our previous embeddings didn’t ask to including any cell-type related information in the gene descriptions, so lets try adding that. You can follow along at generate_genept_embeddings.ipynb.

Here’s our prompt to gpt-4o-mini:

NCBI_UNIPROT_ASSOCIATED_CELL_TYPE_DRUG_PATHWAY_PROMPT_V1 = """Tell me about the {0} gene.

Here is the NCBI and UniProt summary of the gene:

{1}

----

In addition to the provided information, please:

1. List any other genes that the gene is associated with, particularly those not mentioned in the summaries above.
2. List any cell types or cell classes that the gene is expressed in.
3. List any drug or drug classes that are known to interact with this gene. 
4. Pathways and biological processes that this gene is involved in.

Only include specific information about the gene or gene class. If information is not well documented, say so briefly and don't expound on general information.
"""

gpt-4o-mini tended to add fluff and generic information to the description which would probably make the embedding less discriminative, so notice the last line:

Only include specific information about the gene or gene class. If information is not well documented, say so briefly and don't expound on general information.

In my experiments with different prompts, this helped to focus the model on information that it actually knows.

Here’s an example generated response for the well known BRCA1:

1. Associated Genes

BRCA1 is known to interact with several other genes, including but not limited to:

BRCA2: Works in conjunction with BRCA1 in DNA repair processes.

PALB2: Interacts with BRCA1 to facilitate homologous recombination repair.

BARD1: Forms a heterodimer with BRCA1, playing a role in DNA damage response and repair.

CHEK1: Involved in cell cycle regulation and DNA damage response.

FANCD2: Associated with the Fanconi anemia pathway and DNA repair.

2. Cell Types or Cell Classes

BRCA1 is expressed in various cell types, including:

Breast epithelial cells

Ovarian epithelial cells

Other tissues such as prostate, pancreas, and brain cells

Hematopoietic cells (e.g., lymphocytes)

3. Drug or Drug Classes

While specific drugs targeting BRCA1 directly are not well-documented, the following classes of drugs are known to interact with the pathways involving BRCA1:

PARP inhibitors (e.g., Olaparib, Rucaparib): These are used in the treatment of cancers associated with BRCA mutations.

Chemotherapeutic agents: Certain DNA-damaging agents may have altered efficacy in cells with BRCA1 mutations.

4. Pathways and Biological Processes

BRCA1 is involved in several key biological processes and pathways, including:

DNA repair: Particularly in homologous recombination and repair of double-stranded breaks.

Cell cycle regulation: Involvement in G2/M checkpoint control.

Ubiquitination: Acts as an E3 ubiquitin ligase, regulating protein degradation and cellular responses to DNA damage.

Transcription regulation: Interacts with RNA polymerase II and histone deacetylase complexes to influence gene expression.

Centrosome function: Regulates microtubule nucleation at the centrosome.

This information provides a focused overview of BRCA1's associations, expression, interactions, and biological roles.

Ok, GO

We have 37,262 genes to do this for (in our formulation, you could do more or less), so we just hammering the chat-completion API with all of these requests is likely to be slow and unreliable. It also happens to cost exactly 2x more than if we use OpenAI’s handy batch API, designed for exactly things like this. You basically:

You should probably try with like 10 genes and take a look at the results before kicking off the full job, but small batches actually have much slower throughput than large batches. This is because of the way OpenAI squeezes in your batch during lulls in usage.

When I tried 100 genes to start with, they took 48 minutes. Seems pretty slow, but all 37K genes took me 90 minutes, so clearly OpenAI is using up slack resources with some fancy bin-packing.

In any event, we save the descriptions to the HuggingFace dataset honicky/genept-composable-embeddings-source-data to make it easy for anyone to reproduce our results.

Ok, go again

Now we need to take all of the outputs and embed them using OpenAI’s embedding model. We can follow the same steps as above, except this time using the embedding model to encode the text into vectors instead of generating textual gene descriptions.

We take the results and save them to the honicky/genept-composable-embeddings HuggingFace model repo for anyone to use.

How much did it cost

Using gpt-4o-mini, plus the embedding model cost a grand total of:

so $6.83, and I think I ran the embeddings twice, in which case it would be around $5.63

Not bad. How about gpt-4o? ( I used a slightly different prompt that also added tissue-type, so the results are confounded a bit).

so $105.02, or about 19x as expensive. Oof! Since, the prompts were also different it’s not a perfect apples to apples comparison, but the order of magnitude is clearly a lot different.

Did it help?

Ok, we have our fancy new embeddings; does adding the cell-type information help them perform better? Lets just run the tabula_sapiens_analysis_all.ipynb notebook from the previous post again, and compare results.

This time we’ll just take a look at the macro statistics to see how do:

Looks like scGPT wins on TSP2 and TSP14, whereas our gpt-4o-based embeddings win for TSP1. And by a lot for the weighted average. This probably means that scGPT does poorly for one of the bigger classes for TSP1. Lets take a look (green means GenePT is better, red means scGPT is better):

Indeed GenePT did way better for TSP1 on “glandular epithelia cell”, and there were lots of examples in the test set, whereas the scGPT did better for “epithelia cells” and “transitional epithelial cells”, but those had relatively smaller counts for TSP1.

💡

This is not really a very good benchmark, just a way to get familiar with the data without too much overhead on my laptop. In the next post, we’ll explore better benchmarks and scaling up.

Ultimately, it looks like we do a bit better using our new embeddings than the original, but we need to explore further using a larger and more diverse data set. We can start with the whole Tabula Sapiens dataset, and scale up to some larger ones.

Some ideas

So what’s next?

Go Big! - we should just rerun the comparison using all of tabula sapiens instead of a 100k-cell subset. After that, we have a 100M(!) cell dataset to tackle!
Cross-domain search - since our cells and papers can be embedded in the same space, maybe we can search for cells that match papers and papers that match cells?
Bulk data - GenePT works with gene expression counts, so maybe we can embed bulk (e.g. gene expression data from larger tissue samples with lots of cells) expression data and use it for embedding related task?
Better prompts or models - we didn’t really explore better prompts very much, maybe we can improve further by tweaking the prompt to extract better data from gpt-4? or use a model that has better gene expression information in its training set?
Different downstream problems - we haven’t explored how our fancy prompts do on other downstream task like perturbation analysis, disease classification, etc.

I’d love to hear your thoughts and criticism. Comment here, or find me at

Lab Notes: Custom GenePT embeddings

A meh result

Leverage the giant LLMs!

Your very own gene descriptions

1. Associated Genes

2. Cell Types or Cell Classes

3. Drug or Drug Classes

4. Pathways and Biological Processes