CRISPR Repair Outcome Prediction

Daniel_AdediranDaniel_Adediran
3 min read

Overview

pyTDC is a Python package that stands for Target Discovery Toolkit. It's designed to provide tools and utilities for target discovery in drug discovery and development. The toolkit aims to help researchers and practitioners in the field of drug discovery to access and use various target discovery methods, databases, and related information. Here we are retrieving the dataset from pyTDC for CRISPR repair outcome prediction

Installation

To install pyTDC, use the following command:

!pip install pyTDC

Usage Importing the Package and acessing the target Discovery Method

#To obtain the label names,
from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('Leenay')
#Then, proceed with the usual data loader
from tdc.single_pred import CRISPROutcome
data = CRISPROutcome(name = 'Leenay', label_name = label_list[0])
df = data.get_split()
print(df)

Importing the Datasets using Pandas

import pandas as pd
# Convert the dictionary to a DataFrame
df = pd.DataFrame(df['train'])

# Write the DataFrame to a CSV file
df.to_csv('train.csv', index=False)

print('DataFrame converted and saved to CSV successfully.')
print(df)

Importing Libraries

!pip install xgboost
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

# Assuming df is your DataFrame with columns 'GuideSeq_ID', 'GuideSeq', and 'Y'

data = pd.read_csv("/workspace/Crispr-repair-outcome/train.csv")

train_GuideSeq = data['GuideSeq']
train_Y = data['Y']

Encoding

Transform GuideSeq to numerical codes using `LabelEncoder` (assuming categorical data).

# Encode GuideSeq column if needed (assuming it's categorical)
label_encoder = LabelEncoder()
train_df['GuideSeq_encoded'] = label_encoder.fit_transform(train_GuideSeq)

Data Splitting

Separate features and targets (X and y), then split into training and testing sets (80/20 ratio).

# Split the data into features (X) and target variable (y)
X = train_df[['GuideSeq_encoded']]  # Use appropriate features
y = train_df['Y']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Model Building, Prediction and Evaluation

# Models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'SVR': SVR(),
    'XGBoost': XGBRegressor()
}

for model_name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Make predictions
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f'Model: {model_name}')
    print(f'Mean Squared Error: {mse}')
    print(f'R-squared: {r2}')

Save the Model to a Pickle File

import pickle

best_model = None
best_r2 = float('-inf')  # Initialize with negative infinity

# Check if the current model has a higher R-squared than the best so far
if r2 > best_r2:
    best_r2 = r2
    best_model = model


# Save the best model to a file using pickle
if best_model is not None:
    with open('regression.pkl', 'wb') as file:
        pickle.dump(best_model, file)
        print('Best model saved to best_model.pkl')
else:
    print('No best model found.')

Model Prediction

Test the best linear regression model using `X_train` and `y_train`. Predict the damage for a new gRNA sequence length and classify it as "Damaging" or "Not Damaging" based on a threshold of 0.5. Print the predicted damage value and the gRNA effect. Additionally, iterate through various models, train them, predict using `X_test`, and classify predictions based on the threshold, storing the results in `y_pred_classification`.

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score

# Assuming X and y are already prepared and contain the appropriate data

# Train a linear regression model
# Train the model using X (features) and y (target)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train) 

# Threshold for classification
classification_threshold = 0.5

# New sequence length for prediction
gRNA_sequence = np.array([len("ACTTAGACAAGACAAGTAGACA")]).reshape(-1, 1)

# Predict using the model
predicted_damage = regression_model.predict(gRNA_sequence)

# Classify the prediction based on the threshold
prediction_label = "Damaging" if predicted_damage >= classification_threshold else "Not Damaging"

print("Predicted Damage Value:", predicted_damage[0])
print("Effect of the gRNA:", prediction_label)

# Define the classification threshold
classification_threshold = 0.5

for model_name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Make predictions

# Classify predictions based on the threshold
y_pred_classification = np.where(y_pred >= classification_threshold, 1, 0)
0
Subscribe to my newsletter

Read articles from Daniel_Adediran directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Daniel_Adediran
Daniel_Adediran