Introduction

In the age of data-driven healthcare, technology isn’t just enhancing diagnostics — it’s transforming how we approach them altogether. What started as a simple idea for a student project eventually evolved into something much more meaningful: a machine learning-powered disease diagnosis model that adapts dynamically to user input. This blog captures our journey — the sparks of inspiration, the roadblocks we hit, the breakthroughs we celebrated, and the next step we now chase.

Inception of the Idea

In the starting , we began with a familiar question: What can we build that’s both technically impressive and genuinely useful? After tossing around ideas and diving into our usual brainstorming chaos, inspiration came unexpectedly — from an episode of House MD.Basically it is web series mainly focused on Diagnosis of different diseases. It was Diagnosis was a great idea to showcase how patients data can be used to identify diseases using machine learning model.

That thought sparked the concept of a dynamic, intelligent disease diagnosis system.

Core Objective

Our ideas was to accept -

Accept initial symptoms from the user
Dynamically decide the next best symptom to ask about based on confidence interval.
Predict the disease with growing confidence
Know when to stop asking questions intelligently based on cutoff interval.
The core of this project is basically built on entropy, support, confidence interval and concepts of decision tree. which are implemented using Ensemble models such as Random Forest and XG boost.

This wasn’t just a classification problem — it was a conversation with data, and we wanted it to feel just as natural.

Our idea was also to show different approaches towards building model since the Random forest works upon Bagging concept (Majority wins) and XG Boost works upon boosting concept both have their utilities and disadvantages.

To showcase our Efforts we built a GitHub repository so that we can visually show the difference that how data and approach towards cleaning data can affect the model.

Chatgpt and Deep-seek has helped us a lot with the coding part , but the core implementation of ideas and concepts that I(Raman) and Pratham have implemented are learnt organically and with lot of time put into it from basic understanding of concepts from videos of Krish Naik and Campus X to solving and deducing questions using ChatGPT.

Facing Challenges: The Dataset Dilemma

Our journey hit a major bump early on. We began with a large dataset containing over 200,000 entries . But as soon as I came to do hyper parameter tuning we got to know that our machines couldn’t handle it efficiently within the time we had.

I tried maybe 1 week to fix it , asking from different people for their suggestions , many said that we could use high end computers provide by the college. But then I thought the idea was here to learn. So I left it on the branch of GitHub before checkpoint 1.

After multiple failed training sessions (and several system crashes), we made the tough call to switch to a smaller, more manageable dataset. It wasn’t easy letting go of all that data, but this pivot gave us room to iterate, experiment, and actually finish the project.

DATASET :

Initial dataset : https://www.kaggle.com/datasets/dhivyeshrk/diseases-and-symptoms-dataset?resource=download

Final selected Dataset : https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset

Data Preprocessing/ EDA and Feature Engineering.

The data we worked with used initially was in text format we changed it binary valuesusing one Hot Encoding to indicate symptoms, paired with disease labels. In another experimental approach, we used text-based symptom entries and applied TF-IDF transformations. Our main preprocessing steps included:

Handling missing or inconsistent values
Encoding categorical symptom names (for text-based approach)
Creating a Binary Symptom Matrix for structured input
Although each disease had only 120 balanced examples, the use of ensemble models combined with confidence calibration and careful feature pruning ensured the model was robust to overlap. Additionally, we verified through learning curves that increasing data beyond 120 examples would marginally improve performance, but was not critical for acceptable confidence levels.

EDA

Co-occurrance Matrix

Here you can see that by building Co-occurance matrix we have identified different community of diseases that mostly occur together this helps us in building the dynamic questioning Lagos later.

Each decision here played a crucial role in shaping model performance later.

Model Training and Evaluation

We didn’t want to rely on just one model, so we experimented with two solid classifiers:

1.Random Forest Classifier

An ensemble learning model that builds multiple decision trees and aggregates their outputs.

We tuned:
- n_estimators = 150
- max_depth to avoid overfitting
- class_weight = 'balanced' to address class imbalance
- random_state for reproducibility

This model gave reliable results, especially in terms of interpretability and speed.

After running the baseline model we have identified these symptoms are common in dataset

Training the model

Hyper Parameter Tuning

Extraction of Top Symptoms after Baseline model.

ACCURACY IMPROVEMENT AFTER HYPER PARAMETER TUNING

Reliability curve

2.XGBoost Classifier

A more sophisticated boosting model that often excels with skewed data.

Key parameters included:
- learning_rate
- max_depth
- n_estimators
- objective = 'multi:softprob'

XGBoost handled minority classes better and delivered slightly higher accuracy, though it required more tuning and care.

In this project, we built a smart disease diagnosis assistant using symptoms as input. The goal was

to predict the most likely disease based on a patient's symptoms and improve accuracy through

XGBoost, Optuna-based hyperparameter tuning, and a dynamic questioning mechanism.

Why XGBoost?

XGBoost is not an algorithm but a powerful library based on Gradient Boosting. We chose it for its:

- High speed and performance on large, complex data

- Parallel processing and column-wise optimization

- Cache-aware and out-of-core computing for memory efficiency

- Multi-language and library integration

- Support for classification, regression, and ranking problems

Unlike traditional gradient boosting, XGBoost makes training faster using histogram-based methods

(tree_method='hist') and even supports GPU acceleration with gpu_hist.

Sample Tree:

Confusion Matrix

Tuning with Optuna

We initially observed no accuracy improvement after tuning. Upon investigation, we realized that

improper noise handling and poor parameter space choices limited model learning. Once fixed, we

used Optuna, a cutting-edge hyperparameter optimization framework.

Why Optuna?- Faster than GridSearchCV or RandomSearchCV

- Uses Bayesian optimization to intelligently search the best hyperparameters

- Helped boost test set accuracy significantly after tuning

Label Encoding vs One-Hot Encoding

We used LabelEncoder instead of One-Hot Encoding for the target variable (disease) because:

- XGBoost accepts integer-encoded labels for classification tasks

- Its more memory-efficient and compatible with our multiclass classification setup

One-hot encoding would have increased feature space unnecessarily and is better suited for input

features, not labels in this case.

Challenges Faced

1. Lack of Realistic Noisy Data: We had to simulate noise manually by adding synthetic irrelevant

symptoms to mimic real-world scenarios.

2. Binary Matrix Creation: Mapping variable-length symptom inputs to fixed-length vectors was

crucial for compatibility with XGBoost. This matrix helped track symptom presence.

3. Hyperparameter Tuning Trap: Initially, tuning didnt help because of test/train mismatch (e.g.,

noise differences). We corrected this by ensuring consistent data handling.

4. Learning Optuna: We explored Optuna for the first time, and its performance surprised us. It

required minimal setup but yielded significant gains.

Final Thoughts

This project helped us:

- Understand how to process real-world medical data

- Apply XGBoost effectively

- Learn powerful tuning with Optuna

- Build an intelligent, interactive ML system

Bayesian Smoothing: The Confidence Balancer

One issue we faced was overconfident predictions when only a few symptoms were provided. That’s where Bayesian Smoothing came in:

It added pseudo-counts to reduce prediction variance
Smoothed out probability estimates for rare diseases
Helped the model remain cautious in early stages of prediction

This small addition made a noticeable difference in how stable our predictions felt.

sample output:

The Dynamic Questioning Mechanism

Here’s where our model stood out.

We wanted our system to interact like a doctor would — ask one question, evaluate, then decide the next. Here’s how we made that happen:

Start with user-provided symptoms
Predict disease probabilities
If the top prediction’s confidence is below 75%, select the next most relevant symptom
Repeat this loop until the confidence exceeds the threshold or options run out

To choose the next symptom smartly, we used feature importance (from Random Forest) and information gain.

The code for dynamic questioning is present in GitHub :

Sample Output of Code:

GitHub Workflow: Transparent Development

To make our effort visible and traceable (especially for evaluation), we structured our GitHub repo with branches for each major experiment:

here branches:

checkpoint_1

checkpoint_2(with_noise)
checkpoint_2(without-noise)

represent Random Forest Regressor Model

and XGBOOST represent Model made using XGBOOST.

dynamic_questing_with_noise and main contains end point of this project with dynamic questioning using Random Forest regressor.
XGBOOST branch has dynamic questioning logic implemented in this.

Each branch came with its own logs, graphs, and comparison reports. This not only impressed our professors but also helped us stay organized and reflective throughout the project.

GITHUB LINK: https://github.com/raman7976/Symptom-Disease-Diagnosis-ML

The Road Ahead: iOS App Integration

Our dream doesn’t stop at a Python notebook.

Next, we plan to build an iOS app using Swift and SwiftUI that will:

Let users enter their symptoms via a clean UI
Dynamically ask follow-up questions based on model logic
Display predictions with explanation
Store historical results securely (with consent)

We’re aiming to integrate the model via CoreML or serve it using an API backend.

Conclusion

This project wasn’t just about building a machine learning model. It was about solving a real problem with limited resources, making tough choices, and learning how to think like engineers and communicate like researchers.

We grew as developers, as data scientists, and as a team. We learned when to pivot, when to push, and when to simplify. From brainstorming around a Netflix show to planning a full-fledged iOS app — this journey was full of lessons and surprises.

And we’re just getting started.

Stay tuned for the app release — we can’t wait to show you what’s next.

Symptom disease-diagnosis-with-dynamic-questioning

Table of contents