Unlocking Stroke Prediction: Innovative Solutions for Imbalanced Data

Problem

Predicting stroke is challenging when the data doesn’t reflect reality evenly. In many medical datasets, one class—like patients with stroke—can dominate, making it harder for models to learn meaningful patterns for the minority class. This imbalance can lead to biased predictions, poor sensitivity, and real-world risks in clinical decision-making.

How do I train a model that can make predictions regardless of the imbalanced nature of the dataset?

Methodology

Data collection:

The stroke dataset which includes the symptoms exhibited by several patients, the drugs that were used to treat them and the diagnosis of each patient (Stroke or no stroke), was acquired from the International Stroke Trial (IST) database along with its corresponding data dictionary.

Data Preprocessing

The snippet of the dataset above shows the symptoms exhibited by patients, their habits and the drugs used to treat them; however, this study focuses on both the symptoms and diagnosis of the patients; therefore, I don’t need the details about their drugs.

So what did I do?

I selected the features we needed from the dataset.

ex_data = data[["SEX","AGE","RSBP","RVISINF","RDEF1","RDEF2","RDEF3","RDEF4","RDEF5","RDEF6","RDEF7","STYPE","DDIAGISC",
                "DDIAGHA","DDIAGUN","DNOSTRK"]]

The original dataset had 112 features, however after feature selection, 16 features were used to continue the project. These features were then renamed according to the information on the data dictionary.

ex_data.rename(columns = {"RSBP": "BP",
                         "RVISINF": "INFARCTION",
                         "RDEF1": "Face_deficit".upper(),
                         "RDEF2": "Arm_deficit".upper(),
                         "RDEF3": "Leg_deficit".upper(),
                         "RDEF4": "Dysphasia".upper(),
                         "RDEF5": "Hemianopia".upper(),
                         "RDEF6": "VS_disorder".upper(),
                         "RDEF7": "cerebellar_signs".upper(),
                         "STYPE": "Stroke_type".upper(),
                         "DDIAGHA": "Haemorrhagic_diag".upper(),
                          "DDIAGISC": "ischemic_diag".upper(),
                          "DDIAGUN": "Indeterminate_diag".upper(),
                         "DNOSTRK": "Stroke".upper()}, inplace = True)
ex_data.head()

Missing data was discovered in the diagnosis column and removed, after which label encoding was carried out. Most of the variables in the dataset had three values; ‘Y’ for yes, ‘N’ for no, and ‘C’ for can’t say. These values were encoded as 2, 1, and 0, respectively while few columns had other values like ‘U’ for undetermined and ‘u’ for unknown.

Data Exploration

It was discovered during exploration that the dataset was also heavily imbalanced. Considering the fact that it is a stroke dataset, most of the patients were diagnosed with stroke as seen in the chart below, where 1 represents patients with stroke and 0 represents patients without stroke.

Chi-square test was used to analyze relationships among categorical variables. This test was applied to examine two hypotheses:

H0: There is no association between the categorical variables
H1: There is an association between the categorical variables

Using a significance level of 0.05, H0 is rejected if the test statistic exceeds the critical value. The test results were visualized with a heatmap, where darker shades indicate test statistics higher than 0.05—signifying no association between those variables.

Notably, the diagram suggests significant relationships among symptoms like face deficit, leg deficit, cerebellar signs, and visual disorder, indicating their likelihood to co-occur or be interrelated.

Model Building

To improve performance on the imbalanced dataset, I applied two key techniques: cross-validation and data balancing. Cross-validation, specifically Stratified K-Fold, was used to ensure each fold preserved the original class distribution, providing a more reliable evaluation. For data balancing, I used SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples of the minority class, helping the model learn from both classes more effectively—crucial for imbalanced classification tasks like stroke prediction.

Results before data balancing

Before the application of SMOTE, three algorithms were trained with the imbalanced dataset. These were Random forest, logistic regression and support vector machine.

Here are results:

Random Forest

Support Vector Machine

Logistic Regression

The three confusion matrices above show how all the models performed very well in predicting the patients with stroke and very poorly in predicting the patients without stroke.

I know, I know, “Isn’t that what its supposed to do?”

Well, the answer to that question is yes and no. Let me explain. Like previously mentioned, the original data is a stroke trial dataset and so there are much more cases of stroke than people without stroke. So much so that it introduces a bias to the algorithm, making it much easier to recognize people with stroke and not people without stroke. The problem now is that it is very dangerous for a model to misclassify a patient in this case. It’s like a doctor misdiagnosing a patient, we definitely don’t want that.

Despite all this, the models have accuracy scores of 97%,84% and 98% respectively however, the AUC scores shed better light on how well the each model classified each class with 0.5, 0.55 and 0.5 respectively indicating a 50-50 chance of correct prediction.

Data balancing

## data balancing using both skf and SMOTE
for train_index, val_index in stratified_kf.split(X, y):
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]

    smote = SMOTE(random_state=42)
    X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

This code first iterates over train and validation indices generated by the stratified k-fold splitter to ensure that each fold maintains the original class distribution. For each fold, it separates the features (X) and target (y) into training and validation sets. Then, it applies SMOTE on the training data to create synthetic samples for the minority class, thus balancing the dataset. The balanced training set (X_train_balanced and y_train_balanced) was then used to train the models that is less biased by imbalanced class distributions, while the validation set remained untouched for unbiased performance evaluation.

The next step was initializing lists to store various evaluation metrics for each model, training each model using the balanced training data and making predictions on the validation set, obtaining both the predicted classes and their probabilities. Subsequently, the results of each evaluation iteration that is, the AUC score, mean squared error (MSE), accuracy, confusion matrix, and a detailed classification report, were stored in the lists. After collecting these metrics, average values across iterations were calculated.

Results after data balancing

Random Forest

Support Vector Machine

Logistic Regression

That definitely looks better!

Now the three models perform very well while predicting both the stroke and non-stroke classes, reducing the risk of misdiagnosis. Even though the accuracy scores were reduced (normal side effect of data balancing), they were still very good. These were 89%, 86% and 86% respectively. The AUC scores on the other hand, improved with 95%, 93% and 94% indicating a high chance of correct diagnosis of each class.

Model Comparison

Out of the three models, the random forest classifier had the highest accuracy score, the highest AUC score and from the diagram below

the lowest mean square error, making it the best-performing model and ready to be deployed.

Conclusion

By integrating robust techniques such as stratified k-fold cross-validation and SMOTE, this project effectively addressed data imbalance and enhanced model reliability, as evidenced by improved AUC, accuracy, and detailed classification metrics. This approach not only underscores the importance of proper data preprocessing and model evaluation but also how simply overlooking certain metrics can lead to the loss of lives.

Thank you for reading.

I hope you learnt something.

Prediction of Cerebrovascular Accident (Stroke) in patients