In Nigeria, where many individuals pay for healthcare out of pocket, understanding and predicting medical expenses is crucial for financial planning. With data science and machine learning, we can estimate healthcare costs based on demographic and lifestyle factors, helping individuals, policymakers, and private insurers make informed decisions. In this project, we employ a Linear Regression Model to estimate healthcare expenses using Python.

Who Benefits from This Prediction?

Predicting healthcare costs has advantages for multiple stakeholders:

Individuals: Helps people anticipate medical expenses and plan their finances accordingly.
Employers: Assists companies in structuring employee health benefits more effectively.
Healthcare Providers: Enables hospitals and clinics to understand patient spending patterns and adjust services accordingly.
Government and NGOs: Supports policy planning for universal healthcare and targeted subsidies.

Understanding the Dataset

We utilize a dataset that includes the following features:

import pandas as pd

insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
print(insurance.head())

- Age: The age of the individual.

- Sex: The gender of the policyholder.

- BMI: Body Mass Index, an indicator of body fat.

- Children: Number of dependent children covered by the insurance.

- Smoker: Whether the individual smokes or not.

- Region: The geographic region where the policyholder resides.

- Charges: The total medical expenses billed to insurance (our target variable).

       age     sex     bmi  children smoker     region       charges
0     19.0  female  27.900       0.0    yes  southwest     16884.924
1     18.0    male  33.770       1.0     no  Southeast     1725.5523
2     28.0    male  33.000       3.0     no  southeast     $4449.462
3     33.0    male  22.705       0.0     no  northwest  $21984.47061
4     32.0    male  28.880       0.0     no  northwest    $3866.8552
...    ...     ...     ...       ...    ...        ...           ...
1333  50.0    male  30.970       3.0     no  Northwest   $10600.5483
1334 -18.0  female  31.920       0.0     no  Northeast     2205.9808
1335  18.0  female  36.850       0.0     no  southeast    $1629.8335
1336  21.0  female  25.800       0.0     no  southwest      2007.945
1337  61.0  female  29.070       0.0    yes  northwest    29141.3603

[1338 rows x 7 columns]

Data Cleaning and Preprocessing

Before modeling, it is essential to clean the dataset to ensure accurate predictions. The following steps were taken:

1. Standardizing Gender Labels: Different representations of gender (e.g., 'M', 'man', 'F', 'woman') were converted strictly to 'male' and 'female'.

2. Converting Charges to Numeric: Ensuring that the ‘charges’ column was set to a numerical format.

3. Filtering Out Invalid Data: Removing entries where age is below zero and setting negative values of children to zero.

4. Lowercasing Region Names: Standardizing text format for categorical variables. 'insurance): insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'}) insurance['charges'] = insurance['charges'].replace({'$': ''}, regex=True).astype(float) insurance = insurance[insurance['age'] > 0] insurance.loc[insurance['children'] < 0, 'children'] = 0 insurance['region'] = insurance['region'].str.lower() return insurance.dropna() cleaned_insurance = clean_dataset(insurance)

    insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
    insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
    insurance = insurance[insurance["age"] > 0]
    insurance.loc[insurance["children"] < 0, "children"] = 0
    insurance["region"] = insurance["region"].str.lower()

Building the Prediction Model

To predict insurance charges, we implement a Linear Regression Model with the following steps:

1. Feature Engineering:

- Categorical variables ('sex', 'smoker', 'region') are converted into numerical values using one-hot encoding.

- The numerical features ('age', 'BMI', 'children') are preserved

    categorical_features = ['sex', 'smoker', 'region']
    numerical_features = ['age', 'bmi', 'children']

    # Convert categorical variables to dummy variables
    X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)

    # Combine numerical features with dummy variables
    X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)

2. Scaling the Data:

- To ensure uniformity across features, numerical data is standardized using `StandardScaler()`.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_processed)

3. Training the Model:

- A Pipeline is used to automate data scaling and model training.

- A 5-fold cross-validation is performed to assess model reliability.

- The model performance is evaluated using Mean Squared Error (MSE) and R-Squared (R²) score

- Mean MSE: Mean Squared Error (MSE): MSE quantifies the average squared difference between actual and predicted values. It serves as an indicator of how well the model fits the data, with a lower MSE signifying higher accuracy. Since MSE is measured in the squared units of the target variable, a high value indicates significant prediction errors, emphasizing the need for model improvement.

- Mean R² Score: R-Squared (R²) Score: This metric shows how well the model explains the variability in the target variable (insurance charges). It ranges from 0 to 1, where 1 means the model makes perfect predictions, and 0 means it does not explain any variation in the data. A higher R² value suggests better predictive accuracy. However, if R² is too close to 1, it could indicate overfitting, meaning the model might not perform well on new, unseen data.

# Pipeline
    steps = [("scaler", scaler), ("lin_reg", lin_reg)]
    insurance_model_pipeline = Pipeline(steps)

    # Fitting the model
    insurance_model_pipeline.fit(X_scaled, y)

    # Evaluating the model
    mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
    r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
    mean_mse = np.mean(mse_scores)
    mean_r2 = np.mean(r2_scores)

    return insurance_model_pipeline, mean_mse, mean_r2

Model Evaluation

After training, our model produces:

print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)

Making Predictions on New Data

To validate our model, we apply it to a separate validation dataset The predictions are generated and adjusted to ensure a minimum charge of $1000, preventing unrealistic negative values.

validation_data = pd.read_csv('validation_dataset.csv')
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)
validation_predictions = insurance_pipeline.predict(validation_data_processed)
validation_data['predicted_charges'] = validation_predictions
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000
print(validation_data.head())

Conclusion

By leveraging machine learning, we gain valuable insights into healthcare expenses. Insurers can use this model to estimate potential costs, adjust premiums, and offer personalized plans. With further refinement, incorporating additional factors like medical history and lifestyle habits, we can enhance accuracy and fairness in insurance pricing.

Predicting Health Insurance Charges Using Machine Learning

Table of contents