Predicting Employee Attrition - Supervised Classification Machine Learning

VijaykrishnaVijaykrishna
38 min read

Introduction

Employee attrition is one of the major problems that every organization faces, regardless of the sector. It refers to the lifecycle of an organization’s workforce and is an inevitable part of any business. It is defined as the process by which employees leave the workforce for personal or professional reasons. However, it becomes a cause of concern when attrition crosses a particular boundary. For example, attrition among female employee groups could be a significant issue that needs special attention. The different reasons for employee attrition include relocation, lack of professional growth, dissatisfaction, number of accidents, stress level due to overload, conflict, improper clarity, supervisor dissatisfaction, demographic reasons due to marital status, kinship responsibilities, children, age, and tenure, job dissatisfaction due to job involvement, job met expectations, job discontent, monotonous job, less promotional chances, instrumental communication, alternate job opportunities, improper work group cohesion and coworker dissatisfaction, lack of compensation, dissatisfaction in pay and distributive justice, personal reasons, pursuing another better opportunity, pursue higher studies, relocation because of a spouse, organization not delivering on the promise made during recruitment and many more.

Organizations must frame strategies and ideas to control the growing employee attrition rate. The primary responsibility of HR analytics is to help organizations understand what is most important to their employees. The goal is to improve understanding and design of interventions, thereby increasing employee engagement and productivity and reducing unwanted attrition risk. Predictive analytics capabilities enable the design of an employee retention model to keep these valuable employees engaged on board.

Organizations are better off when they can retain good employees and the organizational knowledge they possess. Due to attrition, organizations suffer productivity losses, revenue generation opportunities, and lost profits precisely due to changes in senior management. It becomes a significant cause of concern if employees leave the workforce faster than they are hired; these problems include financial considerations because of the cost incurred due to recruiting, hiring, onboarding, and training new employees. Due to these reasons, companies should use various metrics to measure the reasons for attrition. However, it should be clear that zero attrition (no one leaves the organization) is not possible and not suitable for the organization.

It is critical to understand the reasons that drive unwanted attrition in organizations. The most important reason can be dissatisfaction related to pay. In such a scenario, organizations should follow competitors regarding salary levels, pay increases, and retention schemes to check whether attrition rates align with other local companies in the industry. However, this is not enough; the organization must also check other reasons. Some other observations shared by HR personnel show that the attrition among female employees is generally higher and is lower among new recruiters from top educational institutions. However, a systematic approach will be to analyze the data after developing a comprehensive list of drivers of attrition through interviews with managers, employees, and HR staff. However, a deep insight can be gained by evaluating the results of exit interviews.

It is possible to dig deeper to determine the compelling reasons for attrition by using practical tools and, accordingly, determining strategies. For designing strategies to prevent attrition, different parameters that can be measured related to attrition include involuntary attrition rate, new hire attrition contribution, retention rate, attrition breakdown by performance rating, attrition reason breakdown, employee engagement by employee commitment index, employee engagement index, employee retention index, market opportunity index, offer fit index, voluntary attrition rate, cost of employee attrition, average attrition value, average voluntary attrition value, attrition value per FTA, attrition cost rate, etc. Strategies related to recruiting profiles, manager and employee training programs, career planning and development plans, and pay policies will be better to reduce attrition.

Supervised Classification Machine Learning

Classification is a machine learning technique for solving problems predicting a dependent categorical variable. The classification problem occurs when the dependent variable has two or multiple categories and is predicted by a set of independent variables. Most of the data in real-world problems require the concept of classification. It should be specified that any number of dependent variable categories can exist in classification. However, logistic regression applies only to two categories (binary classification). Logistic regression predicts binary values, such as whether the customer will buy or not, Pass/Fail, Yes/No, 0/1. It measures the probability of event=success and event=failure. The different steps that can be used in prediction using classification techniques include the following:

  1. Data exploration: This step requires preparing data to develop an efficient model. It includes appropriately understanding data, handling missing values, feature engineering, etc.

  2. Model development: Predictive analytics is extracting information from existing datasets to determine patterns and predict future outcomes and trends. We divide the dataset into training and test datasets to assess the patterns and predict outcomes. It is better always to split the data using a particular seed value to have the same data split. This further means that the result generated will always be the same if we split the data. This will help the system generate the same random numbers as specified in the function; otherwise, we would not be able to have uniformity across the different executions of the same program. This will help us always fetch the same result since the same observations will be in the training and test datasets. The model is developed according to the required independent variables on the training dataset. Many supervised machine learning classification techniques are available in KNIME based on different algorithms, such as Naïve Bayes, k-nearest neighbors (k-NN), support vector machines (SVM), decision tree, random forest, bagging, and gradient boosting.

  3. Feature selection: Feature selection is critical in building predictive machine learning models. It involves selecting the most relevant features (variables, predictors) from your data that contribute significantly to the target variable's prediction. Here’s a breakdown of the process and its importance:

    Why Feature Selection?

    • Improves Model Performance: Reducing the number of irrelevant or redundant features can enhance model accuracy and reduce overfitting.

    • Reduces Complexity: Simplifies the model, making it more interpretable and faster to train.

    • Enhances Generalization: Focuses on the most critical features, leading to better generalization of new data.

Methods of Feature Selection

  1. Filter Methods: These methods apply statistical techniques to evaluate the importance of each feature independently of the model.

    a) Correlation Coefficient: Measures the correlation between each feature and the target variable.

    b) Chi-Square Test: Evaluates the independence of categorical features concerning the target variable.

    c) ANOVA: Assesses the relationship between continuous features and the target variable.

  2. Wrapper Methods: These methods evaluate the performance of a subset of features by training and testing a model. They are more computationally intensive but often provide better results.

    a) Recursive Feature Elimination (RFE): Iteratively removes the least important features and evaluates model performance.

    b) Forward/Backward Selection: This method adds or removes features one at a time based on their contribution to the model's performance.

  3. Embedded Methods: These methods perform feature selection as part of the model training process. They are model-specific and incorporate feature selection into the model's construction.

    a) LASSO (Least Absolute Shrinkage and Selection Operator): Adds a penalty term to the model that shrinks the coefficients of less important features to zero.

    b) Tree-Based Methods: Decision trees and random forests naturally perform feature selection based on feature importance scores.

  1. Predicting the model: The developed model predicts the values of the user-defined input of the test dataset. It stores the predicted value of the dependent variable for the given values of the input independent variables.

  2. Determining the model's accuracy: This is the most crucial step because it finally reports the accuracy of the created model. This step shows the final result of the algorithm used on the data. The primary focus is on increasing the model's accuracy with meaningful interpretations. The model's accuracy is determined by comparing the test dataset's predicted and original values. The difference in the preceding values shows the model’s inaccuracy. Different techniques such as accuracy score, classification report, confusion matrix, receiver operating characteristics (ROC), and area under the curve (AUC) are used for measuring the results of classification problems.

    a) Accuracy score: The accuracy is determined using the two inputs: the dependent variable's predicted values and the dependent variable's original values from the test data. The accuracy score is calculated between 0% and 100%.

    b) Confusion matrix: This is used to determine to what extent prediction has been done appropriately. A confusion matrix is a table often used to describe the performance of a classification model on a set of test data for which the actual values are known. To understand the confusion matrix, let us assume the results of a binary classifier. Two possible predicted classes are “yes” and “no”. If we were predicting employee attrition, “yes” would mean they have left the company, and “no” would mean they have not. Suppose the model made 100 predictions (100 employees being analyzed for leaving the company). The model predicted “yes” 40 times and “no” 60 times. However, 45 people left the company in the original dataset, and 55 did not. The confusion matrix will be created as follows:

    • True Positives (TP): These are cases in which the model predicted “yes,” and the employees have left the company.

    • True Negatives (TN): These are cases in which the model predicted “no,” and the employees have not left the company.

    • False Positives (FP) (“Type I error”): We predicted “yes”, but they have not left the company.

    • False Negatives (FN) (“Type II error”): We predicted “no”, and they left the company.

    • Accuracy: Accuracy is determined by (TP+TN)/Total = (30+45)/100 = 0.75 (75%)

Two other terms frequently used to evaluate a model are Sensitivity and Specificity.

  • Sensitivity (True Positive Rate): % of correctly predicted “yes” is determined by TP/(TP+FN) = 30/(30+15) = 0.67 (67%)

  • Specificity (True Negative Rate): % of correctly predicted “no” is determined by TN/(TN+FP) = 45/(45+10) = 0.82 (82%)

c) Classification report: The report displays different values, precision, recall, sensitivity, specificity, F1-score (F-measure), and Cohen’s kappa. Precision is the True Positives ratio to the True and False Positives sum. This further means the percentage of correct for all the positives. Recall is the ratio of True Positives to the sum of the True Positives and False Negatives. This displays the percentage of records that were classified correctly. The F1 score is a weighted harmonic mean of precision and recall; the highest score is 1.0, and the lowest is 0.0. It is suggested that the F1 score be considered when comparing models and classifiers because they consider precision and recall scores in their computation. It should be noted that the F1-score is always lower than accuracy measures. Cohen's Kappa is a statistic that measures inter-rater agreement for categorical items. It considers the possibility of the agreement occurring by chance.

d) ROC curve and AUC: The ROC curve is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis). The ROC curve is usually a good graph to summarize the quality of our classifier. The higher the curve above the diagonal baseline, the better the predictions. The model's performance is determined by looking at the area under the ROC curve, and the value is stored in AUC. The highest value of AUC is 1, while the least is 0.5, which depicts the 45-degree random line. It should be noted that if there is any value less than 0.5, we should do the exact opposite of the recommendation of the model to get a value more than 0.5. It should be noted that the ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

  1. Creating a better model: The main objective of any analyst is to create a better model. Different approaches, such as hyper-parameter tuning and feature engineering, are adopted to achieve this.

Employee Attrition Data

The EmployeeAttrition.csv file helps us understand KNIME usage. This dataset can be downloaded from the “Data” folder on the KNIME Community Hub.

The Predicting Employee Attrition – Supervised Classification Machine Learning workflow can be downloaded from the KNIME Community Hub.

After downloading the CSV file from the Data folder, you are ready.

Reading the CSV-based dataset

The first step in our analysis will be to load the data for our exploratory analyses. We will do this first step using the CSV Reader node before we persist our analysis in a KNIME table.

The KNIME table is created by loading the EmployeeAttrition.csv CSV dataset. The above table shows that the employee dataset has 14999 observations and 10 columns.

Data Exploration

The Statistics node primarily determines the descriptive summary statistics of the columns in the dataset. This node calculates statistical moments such as minimum, maximum, mean, standard deviation, variance, median, overall sum, number of missing values, and row count across all numeric columns. It counts all nominal values together with their occurrences. The node provides the following three output tables

a) Statistics Table: All statistic moments for all numeric columns,

b) Nominal Histogram Table: Nominal values for all selected categorical columns and

c) Occurrences Table: The most frequent/infrequent values from the categorical columns (Top/bottom)

The variable LeftCompany is related to employees who have left the organization (employee attrition, yes – left the company & no – not left the company), and the rest of the columns are independent variables. We can observe from the statistics table output that an employee's SatisfactionLevel and LastEvaluation are measured between 0 and 1. The average satisfaction level is 0.613, and the average last evaluation score is 0.716. The variables WorkAccident, LeftCompany, and PromotionLast5years have binary values (yes and no). The highest number of projects an employee has undertaken is 7, and the maximum time spent in the organization by an employee is 10 years. The average time spent by an employee is 3.5 years, and the average monthly hours spent by an employee is 201 hours. The Salary is represented in the form of low, medium, and high. Similarly, the Department column represents the department to which the employee belongs. The number of employees who left the company is 3571 (23.8%), and 11428 (76.2%) are in the organization.

Data Visualization

Displaying Department-wise Employees Count

We will use the Bar Chart node to generate a bar plot that displays the department-wise employee count.

The chart shows that the Sales department had the most employees, followed by the technical and support departments. The Accounting, HR, and Management departments had the fewest employees.

Displaying Attrition by Department

Say we wanted to count the attrition aggregated across departments. We can use the Pivot node to perform this aggregation using column settings: group by LeftCompany, pivot by Department, and count by Salary. Subsequently, we will use the Bar Chart node to display the attrition count for each department.

The chart shows that people have left from all the departments. However, fewer people have left the Sales department, and proportionately more people have left from the technical, IT, product management, and marketing departments.

Displaying Attrition Rate by Department

The chart shows that the HR department has the highest attrition rate, followed by the accounting department. However, the attrition rate is lower in the research and development and management departments.

Visualizing Employee Details

We will use the Python View node to construct a count plot displaying the number of employees in the LeftCompany, Salary, TimeSpendCompany, PromotionLast5years, NumberProject, and WorkAccident columns.

import knime.scripting.io as knio

# Visualization of employee details
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO

# Plotting 
fig = plt.figure(figsize=(12, 10)) 

#Count plot for displaying number of employees who left/stayed
plt.subplot(231) 
sns.countplot(x='LeftCompany', data=knio.input_tables[0].to_pandas(), hue='LeftCompany', legend=False, palette="viridis") 
plt.title("Left Company")

#Count plot for displaying number of employees with different salaries
plt.subplot(232)
sns.countplot(x='Salary', data=knio.input_tables[0].to_pandas(), hue='Salary', legend=False, palette="mako")
plt.title("Salary-wise Employees")

#Count plot according to number of years spent in organization
plt.subplot(233)
sns.countplot(x='TimeSpendCompany', data=knio.input_tables[0].to_pandas(), hue='TimeSpendCompany', legend=False, palette="vlag")
plt.title("Time Spent")

#Count plot for displaying number of employees according to promotion
plt.subplot(234)
sns.countplot(x='PromotionLast5years', data=knio.input_tables[0].to_pandas(), hue='PromotionLast5years', legend=False, palette="rainbow")
plt.title("Employees Promotions")

#Count plot for displaying number of employees for number of projects
plt.subplot(235)
sns.countplot(x='NumberProject', data=knio.input_tables[0].to_pandas(), hue='NumberProject', legend=False, palette="terrain")
plt.title("Employees by Number of Projects")

#Count plot for displaying number of employees for work accident
plt.subplot(236)
sns.countplot(x='WorkAccident', data=knio.input_tables[0].to_pandas(), hue='WorkAccident', legend=False, palette="prism")
plt.title("Work Accident")

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

We can observe from the chart that employees left, promotion, and work accidents are binary variables. Nearly 23.8% of the employees have left the organization; very few employees have been given promotions in the organization in the last 5 years, and accidents also rarely happen in the workplace. More employees belong to the low-salary group, more have spent 3 years in the organization, and a maximum number of employees have three or four projects.

Determining Correlation Between Variables

We will use the Linear Correlation node to generate a correlation matrix heatmap based on a selected set of columns.

The squared table shows the pair-wise correlation values of all columns. The color range varies from dark red (strong negative correlation) to dark blue (strong positive correlation). If a correlation value for a pair of columns is unavailable, the corresponding cell contains a missing value (shown as a cross in the color view). Hovering the mouse over each box in the views window will display the corresponding pair-wise correlation value. It is clear from the chart that there is a good correlation between the last evaluation, the number of projects, and average monthly hours. This is evident because the good performers might be given a lot of projects to work with. Surprisingly, a negative correlation exists between satisfaction level and number of projects.

Association Between Attrition and Categorical Variables

We will use the Python View node to generate multiple count plots between the LeftCompany column and the NumberProject, Salary*,* WorkAccident*,* and PromotionLast5years categorical columns.

import knime.scripting.io as knio

# Count plot between employee attrition and categorical variables
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO

# Plotting 
fig = plt.figure(figsize=(15,8))
plt.suptitle('Association between Attrition and Categorical Variables',
fontsize=20, fontweight='bold')

# Count plot for displaying number of employees for Number of Projects
plt.subplot(221)
sns.countplot(x='NumberProject', data=knio.input_tables[0].to_pandas(), hue='LeftCompany', palette="Set1_r", legend='auto')
plt.xlabel('Number of Projects')
plt.ylabel('Number of Employees')

# Count plot for displaying number of employees for Salary
plt.subplot(222)
sns.countplot(x='Salary', data=knio.input_tables[0].to_pandas(), hue='LeftCompany', palette="Set2_r", legend='auto')
plt.xlabel('Salary')
plt.ylabel('Number of Employees')

# Count plot for displaying number of employees for Work Accident
plt.subplot(223)
sns.countplot(x='WorkAccident', data=knio.input_tables[0].to_pandas(), hue='LeftCompany', palette="Set3_r", legend='auto')
plt.xlabel('Work Accident')
plt.ylabel('Number of Employees')

# Count plot for displaying number of employees for Promotion in Last 5 Years
plt.subplot(224)
sns.countplot(x='PromotionLast5years', data=knio.input_tables[0].to_pandas(), hue='LeftCompany', palette="Set1", legend='auto')
plt.xlabel('Promotion in Last 5 years')
plt.ylabel('Number of Employees')

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

From the chart of the number of projects and number of employees, we can see that people who had three projects did not leave the organization. But people with 2, 4, 5, and 6 projects left the organization. This means there is no association between the “number of projects” and “number of employees.” It can be observed from the chart that more employees who were offered low to medium-salary packages have left the organization compared to those offered high salaries. It is evident from the chart that many employees left the organization regardless of the work accident. So, work accidents are not an essential factor in leaving the organization. Nearly 4000 employees might have left the organization unsatisfied with their promotion process. From the preceding four categorical data, we can reasonably assume that “promotion” and “salary” may be why employees leave the organization.

Association Between Attrition and Continuous Variables

We will use the Python View node to generate multiple count plots between the LeftCompany column and the LastEvaluation, SatisfactionLevel, and AverageMonthlyHours continuous columns.

import knime.scripting.io as knio

# Load your data (example) 
df = knio.input_tables[0].to_pandas() 

# Count plot between employee absenteeism and categorical variables
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO

# Association between employees attrition and continuous variables
absent_no=df[df["LeftCompany"]=='No']
absent_yes=df[df["LeftCompany"]=='Yes']

# Creating and setting the new palette
new_palette = ["#861B04", "#F86C4D", "#E5F803", "#4DB5F8", "#C4FC05", "#F30BF0", "#05FC93", "#FC0593", "#FC8F05"]
sns.set_palette(palette=new_palette)

fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(13,9))

# Kernel density plot for employee attrition and last evaluation

sns.kdeplot(data=absent_no, x="LastEvaluation", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#861B04', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax1)

sns.kdeplot(data=absent_yes, x="LastEvaluation", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#F86C4D', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax1)

ax1.set_title('Last Evaluation & Left Company')
ax1.legend(['No','Yes'])

# Kernel density plot for employee attrition and satisfaction level
sns.kdeplot(data=absent_no, x="SatisfactionLevel", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#4DB5F8', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax2)

sns.kdeplot(data=absent_yes, x="SatisfactionLevel", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#C4FC05', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax2)

ax2.set_title('Satisfaction Level & Left Company')
ax2.legend(['No','Yes'])

# Kernel density plot for employee attrition and average monthly hours
sns.kdeplot(data=absent_no, x="AverageMonthlyHours", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#05FC93', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax3)

sns.kdeplot(data=absent_yes, x="AverageMonthlyHours", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#FC0593', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax3)

ax3.set_title('Average Monthly Hours & Left Company')
ax3.legend(['No','Yes'])

plt.tight_layout()

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

The chart shows that people who left the organization had evaluation scores less than 0.6 and greater than 0.8. This means that either high performers or low performers left the organization. Employees who performed average stayed in the organization to a great level. This is the same scenario with monthly hours. Employees with low or high average monthly hours left the organization, while those with average hours stayed back. However, more employees who had lower satisfaction levels left the organization.

Scatter Plot Matrix for Continuous Columns by Attrition

We will first use the Color Manager node to assign specific colors for the categories (Yes and No) in the LeftCompany column. Subsequently, we will use the Scatter Plot Matrix node to create a pair-wise scatter plot matrix of SatisfactionLevel, LastEvaluation, and AverageMonthlyHours columns with LeftCompany (Yes and No) as the color dimension for employees.

It is clear from the chart that employees with higher last evaluation and satisfaction levels, higher last evaluation and lower satisfaction levels, and medium last evaluation and medium satisfaction levels tend to leave. However, employees with a lower last evaluation, irrespective of satisfaction level, and employees with the highest satisfaction level and highest last evaluation also do not leave. This is because these employees might be getting higher salaries, and they do not leave. Similarly, employees with more monthly hours tend to leave, and employees with fewer monthly hours, irrespective of LastEvaluation and SatisfactionLevel, do not leave. Thus, AverageMonthlyHours is a significant reason behind leaving the organization. However, employees with higher satisfaction levels do not leave, irrespective of the monthly hours.

Dummy Encoding

As part of feature engineering, converting each category into different columns represented by binary values (0, 1) is essential. We will use the One to Many node to transform all possible category values in the selected columns into a new column. It is worth noting that the number of new columns created for a particular column is the same as the number of categories in the column.

The One to Many node creates new columns according to the number of categories for each variable. Ten new columns have been made since the Department has ten categories. Similarly, three new columns are designed for Salary, and two are for WorkAccident and PromotionLast5years. The result clearly shows that there are 14999 observations with 27 columns. Among these, 17 new columns have binary integer values.

In classification prediction, one category for each categorical variable is considered a reference, and others are considered dummy variables. For this process, the following dummy and reference variables are created for each attribute:

  • For department, “management_Department” is the reference variable.

  • For salary, “low_Salary” is a reference variable.

  • For work accident, “No_WorkAccident” is a reference variable.

  • For promotion in the last five years, “Yes_PromotionLast5years” is the reference variable.

Dropping Reference Variables

It is essential to retain only requirement columns for the analysis. Hence, the reference and categorical variables are removed using the Column Filter node before creating the classification model. LeftCompany is considered the dependent variable.

After dropping the unwanted columns, the table has 14999 rows and 19 columns.

Data Partitioning

The complete dataset clearly shows a much higher number of loan rejections than approvals. As a result, we would need to utilize stratified sampling based on the LeftCompany column in the Partitioning node. The random seed option helps to divide the dataset exactly according to the seed value. This is done intentionally to produce the same results for ease of replication. For this partitioning, the seed value is 30000. The dataset is split in the ratio of 70:30 for the training and test datasets, respectively.

The first partition training table has 10499 rows and 19 columns; the second partition test table has 4500 rows and 19 columns.

Predicting Employee Attrition Using Logistic Regression (All Attributes)

Predict employee attrition using the logistic regression model with all the independent variables. The characteristic of logistic regression helps to solve many real-world problems. Most of the data occurring in real-world problems require the concept of classification. It is essential to mention here that most of the necessary assumptions for regression need not be fulfilled before applying the logistic regression. First, logistic regression does not require a linear relationship between the dependent and independent variables. Second, the error terms do not need to be normally distributed. Third, homoscedasticity is not required. However, logistic regression requires the data to be normal.

We will run the logistic regression model on the training dataset using the Logistic Regression Learner node, with LeftCompany as the target column and ‘No’ as the reference category. We will include all the independent variables that we had filtered earlier. Subsequently, with the Logistic Regression Predictor node, we will predict the response on the test dataset. The resultant output contains three new columns, one containing the prediction for each row and two more containing the predicted probabilities for the target column's two categories (yes and no). They represent the probability that a row in the input data falls into a specific category.

We will use the Scorer node to compare the actual (LeftCompany) and predicted (LeftCompPredLogistic) columns by their attribute value pairs and show the confusion matrix. Additionally, the node provides several accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, Cohen’s Kappa, and F-measures, as well as the overall accuracy.

We will also use the ROC Curve node to visualize and find the Area Under the Curve (AUC) for the model based on the predicted probabilities column (LeftCompany=’Yes’).

The logistic regression model's accuracy for the test data is 0.797. The confusion matrix shows that True Positives are 404 and True Negatives are 3181. Similarly, False Positives are 248 and False Negatives are 667. Hence, accuracy is 0.797, and misclassification is 0.203. This further means that Sensitivity is 0.26 and Specificity is 0.92.

The results of the scorer report can be explained as follows:

Precision is the ratio of True Positives to the sum of True and False Positives. This further means that the percentage of correct positives. Our results show that the precision for ‘Yes’ is 0.62 and for ‘No’ is 0.827. This means that the records having ‘Yes’ are correctly predicted as 62%, and records having ‘No’ are correctly predicted as 82.7%. Recall is the ratio of True Positives to the sum of True Positives and False Negatives. This displays the percentage of records that are classified correctly. Our result shows that recall values are 0.377 and 0.928 for ‘Yes’ and ‘No’, respectively. This further suggests that the accuracy is lower at 0.377 for the categorical value of “Yes”. The F1 score is the weighted harmonic mean of precision and recall; the highest score is 1.0, and the lowest is 0.0. Our test data's F1-scores are 0.469 and 0.874 for ‘Yes’ and ‘No’, respectively. Cohen’s Kappa value is 0.352 (fair agreement), which is low in our case. From the ROC curve, we can observe that the AUC of 0.826 is excellent.

While all these metrics are helpful, each serves a specific purpose in fine-tuning the model and ensuring it meets the desired performance criteria for its particular applications.

Let's explore how these metrics can be used effectively in predicting employee attrition:

Accuracy

  • Utility: Provides a general sense of the model’s overall performance by showing the proportion of correct predictions.

  • Context: Useful when the cost of false positives and false negatives is similar. Attrition prediction shows how well the model is doing overall, but it might not capture the nuances of the data if there’s a class imbalance.

Precision

  • Utility: Focuses on the accuracy of positive predictions (predicting an employee will leave).

  • Context: Important when the cost of a false positive is high (e.g., unnecessary interventions). In predicting attrition, high precision means when the model predicts an employee will leave, it’s likely correct.

Recall

  • Utility: Measures the model’s ability to capture all actual positives.

  • Context: Crucial when the cost of missing a true positive is high (e.g., failing to identify an at-risk employee). High recall ensures that most employees at risk of leaving are identified.

F1-Score (F-Measure)

  • Utility: Balances precision and recall by providing a single metric. It’s the harmonic mean of precision and recall.

  • Context: Useful when there’s a need to balance the trade-offs between precision and recall in attrition prediction, ensuring neither false positives nor false negatives are disproportionately high.

Cohen's Kappa

  • Utility: Adjusts for agreement occurring by chance, measuring how much better the model is compared to random guessing.

  • Context: Useful for evaluating the model’s performance beyond what would be expected by chance, especially in imbalanced datasets. It provides a more nuanced understanding of the model’s predictive power.

False Positive Rate (FPR)

  • Utility: Measures the proportion of negative instances incorrectly classified as positive (employees predicted to leave but stay).

  • Context: Important in scenarios where the cost of false positives is high. In attrition prediction, a high FPR might lead to unnecessary interventions for employees not at risk.

In predicting employee attrition:

  • Accuracy gives an overall idea but may not be sufficient if the classes are imbalanced.

  • Precision ensures that identified at-risk employees are likely to leave.

  • Recall ensures that most at-risk employees are flagged.

  • F1-Score balances the trade-offs between precision and recall.

  • Cohen’s Kappa provides a measure that accounts for chance agreement.

  • FPR helps minimize unnecessary interventions for employees who are not at risk.

Using these metrics together helps you comprehensively evaluate and fine-tune the model, ensuring it’s effective and reliable in predicting employee attrition.

Feature Selection

Now that we have established metrics for a classification model that predicts whether an employee will leave the company or not using all of the features (attributes/independent variables) available in the dataset. The subsequent step is to analyze whether we can achieve better prediction accuracy using fewer features. We shall use the feature reduction nodes in KNIME to do the same.

We will apply the ‘Forward Feature Selection’ strategy to the Feature Selection Loop Start node. This iterative approach starts with no feature selected. Each iteration adds the feature that improves the model the most to the feature set. The Feature Selection Loop End node is used at the end of the feature selection loop to select features according to the feature selection strategy. In our case, we use the ‘Accuracy’ score variable as our objective is to improve/maximize the model's accuracy. Alternatively, we can use other score metrics like ‘Cohen’s Kappa’ or ‘Error’ depending on our model performance objective. Subsequently, we use the Feature Selection Filter node to choose the set of reduced features we want to take forward to build the classification model. In our case, we select the subset of six features with an accuracy of 0.824. The rationale is that it is the subset with the least number of features among the options with the highest accuracy of 0.824. Therefore, we can achieve the highest accuracy possible with the least features. Check the ‘include static columns’ option to include the target variable LeftCompany in the dataset.

Our chosen features subset has the following six independent variables: SatisfactionLevel, NumberProject, marketing_Department, medium_Salary, high_Salary, and Yes_WorkAccident, along with the target variable LeftCompany.

Predicting Employee Attrition Using the Subset of Attributes

We will now predict employee attrition on the reduced subset of the attributes dataset by applying different classification machine learning algorithms. After that, we will collate all the accuracy statistics derived by using these models to the test data to compare the performance of various models and choose the best model to serve the desired objective.

Predicting Employee Attrition Using Logistic Regression (Selected Attributes Subset)

We will run the logistic regression model on the training dataset using the Logistic Regression Learner node, with LeftCompany as the target column and ‘No’ as the reference category. We will include only the six independent variables that we had filtered earlier. Subsequently, with the Logistic Regression Predictor node, we will predict the response on the test dataset. The resultant output contains three new columns, one containing the prediction for each row and two more containing the predicted probabilities for the target column's two categories (yes and no). They represent the probability that a row in the input data falls into a specific category.

We will use the Scorer node to compare the actual (LeftCompany) and predicted columns by their attribute value pairs and show the confusion matrix. Additionally, the node provides several accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, Cohen’s Kappa, and F-measures, as well as the overall accuracy. We will also use the ROC Curve node to visualize and find the Area Under the Curve (AUC) for the model based on the predicted probabilities column (LeftCompany=’Yes’).

Predicting Employee Attrition Using Decision Tree (Selected Attributes Subset)

Decision trees are primarily used in machine learning and data mining applications. They are graphs that represent choices and their outcomes in a tree structure. The nodes in the graph represent an event or choice, and the edges represent the decision rules or conditions. The decision tree classifier is a supervised learning algorithm that can be used for both classification (categorical dependent variable) and regression (continuous dependent variable). However, the factors that help us to decide which algorithm to use are discussed below:

Situations to Use Decision Tree Model

  1. A decision tree is preferred in cases of high nonlinearity and a complex relationship between dependent and independent variables.

  2. A decision tree model is a simple and easy-to-understand graphical representation, even for people from nonanalytical backgrounds. No statistical knowledge is required to understand and interpret the results.

  3. A decision tree is one of the fastest ways to identify the most significant variables and the relation between two or more variables. Decision trees can help us create new variables/features with better power to predict the target variable.

  4. The decision tree model is a nonparametric method unaffected by outliers and missing values. Hence, we do not need to check assumptions. Less data cleaning is required, and imputation is not needed.

Situations When the Decision Tree Model Should Not Be Used

  1. The linear regression algorithm should be adopted if a linear model well approximates the relationship between the dependent and independent variables.

  2. Overfitting is one of the most real challenges in decision tree models. Setting constraints on model parameters and pruning solves this problem.

  3. A decision tree should not be adopted while predicting a dependent continuous variable because while working with continuous numerical variables, it generally loses information when categorizing numerical variables into different categories.

Decision trees are typically drawn upside down such that the terminal node (leaf) is at the bottom and the root node is at the top. The root node represents the entire population or sample, divided into two or more homogeneous sets. Splitting is a process of dividing a node into two or more sub-nodes. When a sub-node splits into further sub-nodes, it is called a decision node. A node divided into sub-nodes is called the parent node of the sub-nodes; sub-nodes are the children of the parent node. The decision to make strategic splits heavily affects a tree’s accuracy. The creation of sub-nodes increases the homogeneity of the resultant sub-nodes. It then selects the split that results in the most homogeneous sub-nodes.

We will run the decision tree model on the training dataset using the Decision Tree Learner node, with LeftCompany as the target column. We will include only the six independent variables that we had filtered earlier. Subsequently, we will predict the response on the test dataset with the Decision Tree Predictor node. The Scorer and ROC Curve nodes are used to calculate the performance statistics for this model.

Predicting Employee Attrition Using Random Forest (Selected Attributes Subset)

Random forest is an extension of bagging. This algorithm also takes a random selection of features rather than using all the features to grow trees along with a random subset of data, as in the bagging algorithm. Decision trees, prone to overfit, have been transformed into random forests by training many trees over various subsamples of the data (in terms of observations and predictors used to train them). Hence, a lot of random trees are generated. Since there are many random trees, it is called a random forest.

We know that error occurs due to two main reasons: bias and variance. A too-complex model has a low bias but a significant variance, while a too-simple model has a low variance but a considerable bias. Hence, we need different ways to solve the problem. We need a variance reduction algorithm for a complex model and a bias reduction algorithm for a simple model. Random forest and boosting algorithms help reduce bias to a great extent in simple models. Random forest also reduces the variance of many complex models with low bias. The trees are large in number, and they are selected randomly. Additional random variable selection makes them even more independent and performs better than the bagging algorithm.

The parameters that control model complexity in decision trees are the pre-pruning parameters that stop the building of the tree before it is fully developed. The main advantage of decision trees is that they can be easily visualized and understood by nonexperts since the algorithm is entirely invariant to data scaling. However, the main drawback of decision trees is that they tend to overfit even with pre-pruning and provide more generalization performance. Hence, most applications use a random forest instead of a single decision tree.

Random forest is mighty and often works well without heavy tuning of the parameters and does not require data scaling. Random forest shares all the benefits of a decision tree. However, decision trees are used only if we need a compact representation of the decision-making process. Random forest helps interpret hundreds of trees in detail, and the trees in random forests tend to be deeper than decision trees. Thus, if a nonexpert wants to make a visual prediction, a single decision tree might be better. Random forest works well on large datasets and training but might be time-consuming and can be parallelized across multicore processors. Random forest performs poorly on high dimensional, sparse data such as text.

Random forest is considered to be a solution to all data science problems. Like a decision tree, random forest can also perform both regression and classification tasks. This method combines a group of weak models to form a robust model. In a random forest, we grow multiple trees instead of a single tree in the decision tree model. In a classification problem, each tree gives a classification, interpreted as the tree that voted for that class. The forest chooses the classification with the most votes (trees in the forest). In regression problems, the algorithm takes the average of outputs by different trees. In the random forest approach, many decision trees are created. Every observation is fed into every decision tree. The most common outcome for each observation is used as the final output. A new observation is fed into all the trees, and a majority vote is taken for each classification model.

The major drawback of random forest is that, like a decision tree, they are appropriate for classification problems but not equally good for regression problems as they do not give precise continuous-nature predictions. Regression does not predict beyond the range in the training data, and it sometimes overfits the datasets. However, because of their high prediction power, a data analyst generally uses random forest for prediction.

We will run the random forest model on the training dataset using the Random Forest Learner node, with LeftCompany as the target column. We will include only the six independent variables that we had filtered earlier. Subsequently, we will predict the response on the test dataset with the Random Forest Predictor node. The Scorer and ROC Curve nodes are used to calculate the performance statistics for this model.

Predicting Employee Attrition Using Gradient Boosted Trees (Selected Attributes Subset)

Boosting is another ensemble technique to create a collection of predictors, and gradient boosting is an extension of the boosting method. An ensemble of trees is built one by one, and individual trees are summed sequentially. In this algorithm, a new tree learns sequentially from the previous tree, which fits relatively simple models to the data. We fit consecutive trees, and at every step, the goal is to reduce errors from the previous tree. If an input is wrongly interpreted, its weight is increased so that the next step classifies it correctly. Combining the whole set at the end converts weak learners into a better-performing mode. The model works in a chain or nested iterative model. Thus, the new models are not independent parallel models, but each model is built based on all the previous small models by weighting. Hence, the latest model gets boosted from the earlier model. This helps in reducing the bias of a large number of small models with low variance. This algorithm supports different loss functions and works well with interactions, but it is prone to overfitting and requires careful tuning of different hyper-parameters.

Unlike bagging, booting does not focus on reducing the variance of learners. Still, it focuses on reducing the high variance of learners by averaging lots of models fitted on bootstrapped data samples generated with replacement from the training data to avoid overfitting. Boosting is a sequential technique that works on the principle of ensemble. It combines a set of weak learners and delivers improved prediction accuracy. At any instant t, the model outcomes are weighted based on the outcomes of the previous instant t-1. The outcomes predicted correctly are given a lower weight, and the ones misclassified are weighted higher. This technique is followed for a classification problem, while a similar method is used for regression.

Another major difference between the two techniques is that in bagging, the various models generated are independent of each other and have equal weightage, whereas boosting is a sequential process in which each new model generated is added to improve the performance of the previous collection of models.

We will run the gradient-boosted trees model on the training dataset using the Gradient Boosted Trees Learner node, with LeftCompany as the target column. We will include only the six independent variables that we had filtered earlier. Subsequently, we will predict the response on the test dataset with the Gradient Boosted Trees Predictor node. The Scorer and ROC Curve nodes are used to calculate the performance statistics for this model.

Collating the Performance Metrics for Comparison

Now that we have developed multiple models. The next logical step is to choose the optimum classification model to predict employee attrition based on the performance metrics. To do so, we will collate all the performance statistics from the various classification models and then conduct a comparative analysis.

import knime.scripting.io as knio

# This example script plots a histogram using matplotlib and assigns it to the output view
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib

# Set the backend to 'Agg' to ensure compatibility with non-GUI environments
matplotlib.use('Agg')

# Only use numeric columns
data = knio.input_tables[0].to_pandas()

# Determine the feature selection category with the highest F-score for each "Measure Names"
highlight_data = data.loc[data.groupby("Measure Name")["Measure Value"].idxmax()]

# Create FacetGrid
g = sns.FacetGrid(data, col="Measure Name", col_wrap=3, height=4, aspect=1.5, sharey=False)

# Add the plots
g.map_dataframe(sns.barplot, x="Classification Model", y="Measure Value", hue="Classification Model", palette="muted", legend=False)

# Highlight the bars with the highest F-score
for ax, (measure_name, group) in zip(g.axes.flat, data.groupby("Measure Name")):  # Use .flat to ensure compatibility
    highlight = highlight_data[highlight_data["Measure Name"] == measure_name]
    for p in ax.patches:
        if p.get_x() + p.get_width() / 2. in highlight["Classification Model"].values:
            p.set_edgecolor('red')
            p.set_linewidth(2)

# Rotate x-axis labels for better readability
for ax in g.axes.flat:  # Use .flat to ensure compatibility
    for label in ax.get_xticklabels():
        label.set_rotation(45)
        label.set_ha('right')

# Adjust layout to prevent label overlap
plt.tight_layout()

# Show data labels on each of the bars for 3 decimal places
for ax in g.axes.flat:  # Use .flat to ensure compatibility
    for p in ax.patches:
        if p.get_height() > 0:  # Avoid warnings for bars with height 0
            ax.text(p.get_x() + p.get_width() / 2., p.get_height(), '{:.3f}'.format(p.get_height()), 
                    fontsize=9, color='black', ha='center', va='bottom')

# Increase the padding between chart titles and the charts
plt.subplots_adjust(top=0.9)

# Assign the figure to the output_view variable
knio.output_view = knio.view(plt.gcf())  # Use plt.gcf() to get the current figure

Our comparative analysis indicates that we can effectively use a selected subset of attributes rather than the complete dataset for future predictions while maintaining model accuracy. Based on the above comparative table and graph, it is safe to conclude that the Decision Tree model strikes the optimum balance across the performance parameters for predicting employee attrition with a selected subset of attributes.

Best Practices for Using these Metrics Effectively

Here are some best practices for using Accuracy, Cohen's Kappa, and False Positive Rate effectively in your machine learning projects:

Model Accuracy

  • Combine with Other Metrics: Always use accuracy with other metrics like precision, recall, and F1-score, especially when dealing with imbalanced datasets.

  • Cross-Validation: Use cross-validation to get a more reliable estimate of your model's accuracy.

  • Monitor Over Time: Track accuracy to detect performance drift or data changes.

Cohen's Kappa

  • Interpret Correctly: Understand the range and meaning of Kappa values. Remember that values close to 1 indicate strong agreement, while values close to 0 indicate agreement similar to chance.

  • Class Distribution Check: Be mindful of class distributions, as highly imbalanced data can affect the interpretation of Kappa.

  • Compare with Other Metrics: Use Cohen's Kappa alongside other metrics to get a fuller picture of model performance.

False Positive Rate (FPR)

  • Threshold Tuning: Adjust the decision threshold to balance FPR with other metrics, such as False Negative Rate (FNR) and Precision.

  • Context Awareness: Consider the specific context of your application. In scenarios like medical diagnosis, prioritize minimizing FPR to avoid false alarms.

  • ROC Curve: Use the Receiver Operating Characteristic (ROC) curve to analyze the trade-off between True Positive Rate (TPR) and FPR at different thresholds.

General Best Practices

  • Understand Context: Tailor the choice of metrics to the specific application and the consequences of different types of errors.

  • Data Quality: Ensure high-quality data with accurate labels. Poor data quality can skew all metrics.

  • Continuous Monitoring: Implement continuous monitoring to detect any changes in model performance and take corrective actions promptly.

  • Transparency: Document and communicate the chosen metrics and their implications to stakeholders.

  • Iterative Evaluation: Regularly revisit and refine the metrics as the model evolves and the application context changes.

By following these best practices, you can make more informed decisions about model performance and ensure your machine-learning models are reliable and effective.

Summary

In conclusion, predicting employee attrition using supervised classification machine learning techniques offers organizations a powerful tool to understand and mitigate the factors leading to employee turnover. By leveraging data-driven insights, companies can identify at-risk employees and implement targeted interventions to enhance retention. The process involves data exploration, model development, feature selection, and accuracy assessment, ensuring the predictive model is robust and reliable. By focusing on key metrics such as precision, recall, and F1-score, organizations can fine-tune their models to balance the trade-offs between false positives and false negatives, ultimately leading to more effective human resource strategies. This approach helps retain valuable talent and contributes to maintaining organizational knowledge and reducing the costs associated with high attrition rates.

0
Subscribe to my newsletter

Read articles from Vijaykrishna directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vijaykrishna
Vijaykrishna

I’m a data science enthusiast who loves to build projects in KNIME and share valuable tips on this blog.