Predicting Health Insurance Charges Using Machine Learning

In Nigeria, where many individuals pay for healthcare out of pocket, understanding and predicting medical expenses is crucial for financial planning. With data science and machine learning, we can estimate healthcare costs based on demographic and lifestyle factors, helping individuals, policymakers, and private insurers make informed decisions. In this project, we employ a Linear Regression Model to estimate healthcare expenses using Python.
Who Benefits from This Prediction?
Predicting healthcare costs has advantages for multiple stakeholders:
Individuals: Helps people anticipate medical expenses and plan their finances accordingly.
Employers: Assists companies in structuring employee health benefits more effectively.
Healthcare Providers: Enables hospitals and clinics to understand patient spending patterns and adjust services accordingly.
Government and NGOs: Supports policy planning for universal healthcare and targeted subsidies.
Understanding the Dataset
We utilize a dataset that includes the following features:
import pandas as pd
insurance_data_path = 'insurance.csv'
insurance = pd.read_csv(insurance_data_path)
print(insurance.head())
- Age: The age of the individual.
- Sex: The gender of the policyholder.
- BMI: Body Mass Index, an indicator of body fat.
- Children: Number of dependent children covered by the insurance.
- Smoker: Whether the individual smokes or not.
- Region: The geographic region where the policyholder resides.
- Charges: The total medical expenses billed to insurance (our target variable).
age sex bmi children smoker region charges
0 19.0 female 27.900 0.0 yes southwest 16884.924
1 18.0 male 33.770 1.0 no Southeast 1725.5523
2 28.0 male 33.000 3.0 no southeast $4449.462
3 33.0 male 22.705 0.0 no northwest $21984.47061
4 32.0 male 28.880 0.0 no northwest $3866.8552
... ... ... ... ... ... ... ...
1333 50.0 male 30.970 3.0 no Northwest $10600.5483
1334 -18.0 female 31.920 0.0 no Northeast 2205.9808
1335 18.0 female 36.850 0.0 no southeast $1629.8335
1336 21.0 female 25.800 0.0 no southwest 2007.945
1337 61.0 female 29.070 0.0 yes northwest 29141.3603
[1338 rows x 7 columns]
Data Cleaning and Preprocessing
Before modeling, it is essential to clean the dataset to ensure accurate predictions. The following steps were taken:
1. Standardizing Gender Labels: Different representations of gender (e.g., 'M', 'man', 'F', 'woman') were converted strictly to 'male' and 'female'.
2. Converting Charges to Numeric: Ensuring that the ‘charges’ column was set to a numerical format.
3. Filtering Out Invalid Data: Removing entries where age is below zero and setting negative values of children to zero.
4. Lowercasing Region Names: Standardizing text format for categorical variables. 'insurance): insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'}) insurance['charges'] = insurance['charges'].replace({'$': ''}, regex=True).astype(float) insurance = insurance[insurance['age'] > 0] insurance.loc[insurance['children'] < 0, 'children'] = 0 insurance['region'] = insurance['region'].str.lower() return insurance.dropna() cleaned_insurance = clean_dataset(insurance)
insurance['sex'] = insurance['sex'].replace({'M': 'male', 'man': 'male', 'F': 'female', 'woman': 'female'})
insurance['charges'] = insurance['charges'].replace({'\$': ''}, regex=True).astype(float)
insurance = insurance[insurance["age"] > 0]
insurance.loc[insurance["children"] < 0, "children"] = 0
insurance["region"] = insurance["region"].str.lower()
Building the Prediction Model
To predict insurance charges, we implement a Linear Regression Model with the following steps:
1. Feature Engineering:
- Categorical variables ('sex', 'smoker', 'region') are converted into numerical values using one-hot encoding.
- The numerical features ('age', 'BMI', 'children') are preserved
categorical_features = ['sex', 'smoker', 'region']
numerical_features = ['age', 'bmi', 'children']
# Convert categorical variables to dummy variables
X_categorical = pd.get_dummies(X[categorical_features], drop_first=True)
# Combine numerical features with dummy variables
X_processed = pd.concat([X[numerical_features], X_categorical], axis=1)
2. Scaling the Data:
- To ensure uniformity across features, numerical data is standardized using `StandardScaler()`.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_processed)
3. Training the Model:
- A Pipeline is used to automate data scaling and model training.
- A 5-fold cross-validation is performed to assess model reliability.
- The model performance is evaluated using Mean Squared Error (MSE) and R-Squared (R²) score
- Mean MSE: Mean Squared Error (MSE): MSE quantifies the average squared difference between actual and predicted values. It serves as an indicator of how well the model fits the data, with a lower MSE signifying higher accuracy. Since MSE is measured in the squared units of the target variable, a high value indicates significant prediction errors, emphasizing the need for model improvement.
- Mean R² Score: R-Squared (R²) Score: This metric shows how well the model explains the variability in the target variable (insurance charges). It ranges from 0 to 1, where 1 means the model makes perfect predictions, and 0 means it does not explain any variation in the data. A higher R² value suggests better predictive accuracy. However, if R² is too close to 1, it could indicate overfitting, meaning the model might not perform well on new, unseen data.
# Pipeline
steps = [("scaler", scaler), ("lin_reg", lin_reg)]
insurance_model_pipeline = Pipeline(steps)
# Fitting the model
insurance_model_pipeline.fit(X_scaled, y)
# Evaluating the model
mse_scores = -cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(insurance_model_pipeline, X_scaled, y, cv=5, scoring='r2')
mean_mse = np.mean(mse_scores)
mean_r2 = np.mean(r2_scores)
return insurance_model_pipeline, mean_mse, mean_r2
Model Evaluation
After training, our model produces:
print("Mean MSE:", mean_mse)
print("Mean R2:", mean_r2)
Making Predictions on New Data
To validate our model, we apply it to a separate validation dataset The predictions are generated and adjusted to ensure a minimum charge of $1000, preventing unrealistic negative values.
validation_data = pd.read_csv('validation_dataset.csv')
validation_data_processed = pd.get_dummies(validation_data, columns=['sex', 'smoker', 'region'], drop_first=True)
validation_predictions = insurance_pipeline.predict(validation_data_processed)
validation_data['predicted_charges'] = validation_predictions
validation_data.loc[validation_data['predicted_charges'] < 1000, 'predicted_charges'] = 1000
print(validation_data.head())
Conclusion
By leveraging machine learning, we gain valuable insights into healthcare expenses. Insurers can use this model to estimate potential costs, adjust premiums, and offer personalized plans. With further refinement, incorporating additional factors like medical history and lifestyle habits, we can enhance accuracy and fairness in insurance pricing.
Subscribe to my newsletter
Read articles from Ifeoma Okafor directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Ifeoma Okafor
Ifeoma Okafor
I am passionate about technology, a pharmacist and a passionate advocate for public health. I firmly believe that leveraging technology is pivotal in advancing healthcare and ensuring its accessibility to all individuals. When I am not immersed in the activities stated above, I find solace in indulging in life's simple pleasures. Reading captivating books, watching thought-provoking movies, and cherishing invaluable moments with loved ones are among the many ways I find joy outside of my professional pursuits. Read, follow and leave a comment ❤️