Identify Absenteeism Patterns - Cluster Analysis

VijaykrishnaVijaykrishna
13 min read

Introduction

Absenteeism is a habitual pattern of absence from a duty or obligation. Employee absenteeism is expensive and incurs a significant loss to an organization. It is essential to identify employees' characteristics according to their level of absenteeism. This is done using cluster analysis techniques to group employees based on similar characteristics. There are two main techniques for doing cluster analysis: k-means clustering and hierarchical clustering.

Unsupervised Machine Learning

Unsupervised machine learning algorithms are used when the output is unknown, and no predefined instructions are available to the learning algorithms. In unsupervised learning, the learning algorithm only has input data, and knowledge is extracted from these data. These algorithms create a new representation of the data that is easier to comprehend than the original data and help improve the accuracy of advanced algorithms by consuming less time and reducing memory. Standard unsupervised machine learning algorithms include association rule mining, dimensionality reduction algorithms, and clustering.

Clustering Techniques

Clustering is the task of partitioning the data into groups called clusters in which members are similar in some way. A cluster is a collection of observations that are similar among themselves but are different from the observations belonging to other clusters. K-means clustering and hierarchical clustering are commonly used clustering algorithms. Clustering tries to find cluster centers that represent some areas of the data.

Clustering deals with finding a structure in a collection of unlabeled data. Data points inside a cluster are homogeneous and heterogeneous to other groups. The correct number of clusters is an essential issue because it becomes noise beyond the correct number, and you are not capturing any observations below the number. Generally, two forms of clustering, k-means and hierarchical clustering, are used for grouping employees or observations.

Employee Absenteeism Data

The EmployeeAbsenteeism.csv file helps us understand KNIME usage. This dataset can be downloaded from the “Data” folder on the KNIME Community Hub.

The Identify Absenteeism Patterns – Cluster Analysis workflow can be downloaded from the KNIME Community Hub.

After downloading the CSV file from the Data folder, you are ready.

Reading the CSV-based dataset

The first step in our analysis will be to load the data for our exploratory analyses. We will do this first step using the CSV Reader node before we persist our analysis in a KNIME table.

The KNIME table is created by loading the EmployeeAbsenteeism.csv CSV dataset. The above table shows that the employee dataset has 740 observations and 21 columns.

Converting Continuous Column to Categorical Using Binning

Since the clustering works on categorical columns, converting the AbsenteeismTimeInHours column into a categorical one is essential. We will use the Auto-Binner node to convert the AbsenteeismTimeInHours column into a categorical column by splitting it into three equal bins based on the frequency of the observations in the column. Subsequently, we will use the Rule Engine node to assign meaningful category values (low, avg, high) to the Absenteeism column for easy analysis and interpretation.

Data Visualization

We will use the Python View node to generate multiple count plots between the Absenteeism column and the SocialDrinker, SocialSmoker*,* Son, Seasons, Pet, and DayOfTheWeek categorical columns.

import knime.scripting.io as knio

# Load your data (example) 
# df = knio.input_tables[0].to_pandas() 

# Count plot between employee absenteeism and categorical variables
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO

# Plotting 
fig = plt.figure(figsize=(15,8))
plt.suptitle('Association between Absenteeism and Categorical Variables',
fontsize=20, fontweight='bold')

# Count plot for displaying number of employees for Social drinker
plt.subplot(231)
sns.countplot(x='SocialDrinker', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set1_r", legend='auto')
plt.xlabel('Social Drinker')
plt.ylabel('Number of employees')

# Count plot for displaying number of employees for Social smoker
plt.subplot(232)
sns.countplot(x='SocialSmoker', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set2_r", legend='auto')
plt.xlabel('Social Smoker')
plt.ylabel('Number of employees')

# Count plot for displaying number of employees for Son
plt.subplot(233)
sns.countplot(x='Son', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set3_r", legend='auto')
plt.xlabel('Son')
plt.ylabel('Number of employees')

# Count plot for displaying number of employees for Seasons
plt.subplot(234)
sns.countplot(x='Seasons', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set1", legend='auto')
plt.xlabel('Seasons')
plt.ylabel('Number of employees')

# Count plot for displaying number of employees for Pet
plt.subplot(235)
sns.countplot(x='Pet', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set2", legend='auto')
plt.xlabel('Pet')
plt.ylabel('Number of employees')

# Count plot for displaying number of employees for Day of the week
plt.subplot(236)
sns.countplot(x='DayOfTheWeek', data=knio.input_tables[0].to_pandas(), hue='Absenteeism', palette="Set3", legend='auto')
plt.xlabel('Day of the week')
plt.ylabel('Number of employees')

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

The chart clearly shows that employees generally have average absenteeism. A more significant number of employees have no pets. The day and season are not crucial for absenteeism because the number of absenteeism is uniform according to the day of the week and season. It can be observed that most of the absent employees do not smoke. Thus, smoking is not the reason for absenteeism. For the absentees, the number of employees who drink is the same as those who do not drink. It is also clear that more employees have no or one child.

We will use the Python View node to generate multiple count plots between the Absenteeism column and the BodyMassIndex, Age, and TransportationExpense continuous columns. Alternatively, we can use the Density Plot node to create an interactive plot with the Absenteeism column as the condition column and the BodyMassIndex, Age, and TransportationExpense as the dimension columns.

import knime.scripting.io as knio

# Load your data (example) 
df = knio.input_tables[0].to_pandas() 

# Count plot between employee absenteeism and categorical variables
# Importing libraries
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO

# Association between employees absenteeism and continuous variables
absent_low=df[df["Absenteeism"]=='Low']
absent_medium=df[df["Absenteeism"]=='Medium']
absent_high=df[df["Absenteeism"]=='High']

# Creating and setting the new palette
new_palette = ["#861B04", "#F86C4D", "#E5F803", "#4DB5F8", "#C4FC05", "#F30BF0", "#05FC93", "#FC0593", "#FC8F05"]
sns.set_palette(palette=new_palette)

fig, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=(13,9))

# Kernel density plot for employee attrition for body mass index

sns.kdeplot(data=absent_low, x="BodyMassIndex", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#861B04', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax1)

sns.kdeplot(data=absent_medium, x="BodyMassIndex", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#F86C4D', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax1)

sns.kdeplot(data=absent_high, x="BodyMassIndex", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#E5F803', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax1)
ax1.set_title('Body Mass Index & Absenteeism')
ax1.legend(['Low','Medium','High'])

# Kernel density plot for employee attrition for age
sns.kdeplot(data=absent_low, x="Age", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#4DB5F8', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax2)

sns.kdeplot(data=absent_medium, x="Age", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#C4FC05', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax2)

sns.kdeplot(data=absent_high, x="Age", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#F30BF0', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax2)
ax2.set_title('Age & Absenteeism')
ax2.legend(['Low','Medium','High'])

# Kernel density plot for employee attrition for transportation expense
sns.kdeplot(data=absent_low, x="TransportationExpense", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#05FC93', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax3)

sns.kdeplot(data=absent_medium, x="TransportationExpense", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#FC0593', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax3)

sns.kdeplot(data=absent_high, x="TransportationExpense", y=None, hue=None, weights=None, palette=None, hue_order=None, hue_norm=None, color='#FC8F05', fill=True, 
multiple="layer", common_norm=True, common_grid=False, cumulative=False, bw_method="scott", bw_adjust=1, warn_singular=True, 
log_scale=None, levels=10, thresh=.05, gridsize=200, cut=3, clip=None, legend=True, cbar=False, cbar_ax=None, cbar_kws=None, ax=ax3)
ax3.set_title('Transportation Expense & Absenteeism')
ax3.legend(['Low','Medium','High'])

plt.tight_layout()

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

Since the different categories of absenteeism in the chart overlap considerably, it is clear that the level of absenteeism is not related to transportation expense, age, or body mass index.

Assess Employee Absenteeism Using K-Means Clustering

Objective: To cluster the employee records for absenteeism using K-means cluster analysis

K-means clustering is an unsupervised algorithm that solves the clustering problem by following a simple and easy way to classify a given dataset through a certain number of clusters. We must determine the number of clusters for practical analysis before applying the K-means algorithm. The number of clusters can differ depending on the initialization and the distance function. A better machine-learning algorithm can be created if effective initialization and better distance functions are defined.

The K-means picks K number f points for each cluster known as centroids. Each data point forms a cluster with the closest centroids, K clusters. We find the centroid of each cluster based on the existing cluster members. Here we have new centroids. As we have new centroids, we repeat the steps. We find the closest distance for each data point from the new centroids associated with new K clusters. We repeat this process until convergence occurs; centroids do not change. We need to determine the distances between data points without having any information on the features of these points. Clustering aims to minimize the distance between the points and their representatives. However, even if K-means clustering is carried out many times, we are not sure we will get a globally optimal solution. Hence, we need to try different initialization points to find the optima and consider the best among those local optima.

In K-means, we have clusters, each with its centroid. The sum of the squares differences between the centroid and the data points within a cluster constitutes the sum of the square values for that cluster. Also, when the sum of square values for all the clusters is done, it becomes the total within the sum of the square values for the cluster solution. We know that as the number of clusters increases, this value keeps on decreasing, but if we plot the result, you may see that the sum of squared distance decreases sharply up to some value of k and then much more slowly after that. This determines the optimum number of clusters.

Using the Elbow Method

Objective: To determine the optimal number of clusters using the elbow method.

Determining the optimum number of clusters is always recommended before applying the K-means algorithm. We will use the sklearn.cluster library in the Python Script node to iterate and determine the optimal K value. An empty data list is created. A for loop is made from 1 to 11, and the K-means algorithm is executed considering the values of K from 1 to 10. The command KMeans(n_clusters=I,random_state=42) applies the K-means algorithm, and the value of inertia is stored in the list corresponding to each value of K. The command datalist.append(kmeans.inertia_) adds the value of inertia to datalist. Subsequently, we will use the Line Plot (Plotly) node to plot the different values of the Number of Clusters and the corresponding value of Inertia.

import knime.scripting.io as knio

# This example script simply outputs the node's input table.
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
import pandas as pd

# Load the input data
df = knio.input_tables[0].to_pandas()

# Prepare the data for clustering (assuming all columns are used for clustering)
AbsenteeismData = df.values

# Initialize an empty list to store inertia values
datalist = []

# Loop through different numbers of clusters to calculate inertia
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(AbsenteeismData)
    datalist.append(kmeans.inertia_)

# Convert datalist to a DataFrame for output
output_df = pd.DataFrame({'Number of Clusters': range(1, 11), 'Inertia': datalist})

# Output the datalist as a table
knio.output_tables[0] = knio.Table.from_pandas(output_df)

From the line plot, we can observe that the optimum number of clusters is 3. So, we will start cluster analysis considering three clusters.

Applying K-means Clustering

Before applying the K-means clustering, we will use the Normalizer node to normalize the continuous columns using the Z-score normalization method. We will then use the k-means node to apply the K-means algorithm to predict the cluster for each observation in the dataset. We will use the Rule Engine node to assign the clusters to the corresponding categories (low, avg, and high) of absenteeism. Finally, we will use the Scorer node to compare the actual (Absenteeism) and predicted cluster (AbsenteeismPred) columns by their attribute value pairs and show the confusion matrix. Additionally, the node provides several accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, and F-measures, as well as the overall accuracy.

The confusion matrix result shows that 77 observations belonging to the average cluster are grouped similarly in the predicted cluster. Similarly, the low cluster had 118 correctly classified employees, and the high cluster had 129 correctly classified employees. The accuracy score is 43.784%. However, forming proper categories of absenteeism after closely examining the data will help improve the accuracy.

Assessing Employee Absenteeism Using Hierarchical Clustering

Objective: To cluster the employee records for absenteeism using hierarchical cluster analysis

Hierarchical clustering uses two approaches: top-down and bottom-up.

  • Top-down or divisive: The algorithm starts with all data points in one huge cluster. The most dissimilar data points are divided into subclusters until each cluster has exactly one data point.

  • Bottom-up or agglomerative: The algorithm starts with every data point as one single cluster and tries to combine the most similar ones into superclusters until it reaches one huge cluster containing all subclusters.

In the case of small observations, bottom-up or agglomerative hierarchical clustering provides better results since it helps us build clusters from n to 1 by merging the clusters bottom-up, obtaining all possible clusters between 1 and n.

A measure must be defined to determine the distance between clusters. There are three methods for comparing two clusters.

  • Single Linkage: This defines the distance between two clusters, c1, and c2, as the minimal distance between any two points, x, and y, with x in c1 and y in c2.

  • Complete Linkage: Defines the distance between two clusters, c1 and c2, as the maximal distance between any two points x, y with x in c1 and y in c2.

  • Average Linkage: Defines the distance between two clusters, c1, and c2, as the mean distance between all points in c1 and c2.

A distance measure is necessary to measure the distance between two points. You can choose between the Manhattan and Euclidean distances, corresponding to the L1 and L2 norms.

The output of the Hierarchical Clustering node is the same data as the input, with one additional column containing the cluster name to which the data point is assigned. Since the hierarchical clustering algorithm produces a series of cluster results, the number of clusters for the output is defined as “three,” “Euclidean” as the distance function, and “complete” as the linkage type in the dialog. The production of the agglomerative clustering is depicted using a dendrogram. A dendrogram is a diagram that shows the hierarchical relationship between objects. A dendrogram's primary use is to find the best way to allocate objects to clusters.

We will use the Rule Engine node to assign the clusters to the corresponding categories (low, avg, and high) of absenteeism. Finally, we will use the Scorer node to compare the actual (Absenteeism) and predicted cluster (AbsenteeismPred) columns by their attribute value pairs and show the confusion matrix. Additionally, the node provides several accuracy statistics such as True-Positives, False-Positives, True-Negatives, False-Negatives, Recall, Precision, Sensitivity, Specificity, and F-measures, as well as the overall accuracy.

The confusion matrix result shows that 35 observations belonging to the average cluster are grouped similarly in the predicted cluster. Similarly, the low cluster had 207 correctly classified employees, and the high cluster had 35 correctly classified employees. The accuracy score is 37.432%. However, forming proper categories of absenteeism after closely examining the data will help improve the accuracy.

Summary

In conclusion, identifying absenteeism patterns through cluster analysis is a valuable approach for organizations to understand and address employee absenteeism. By utilizing unsupervised machine learning techniques such as k-means and hierarchical clustering, organizations can group employees based on similar characteristics and identify patterns that may contribute to absenteeism. Analyzing employee absenteeism data, including converting continuous data to categorical data and visualization tools, provides insights into factors that may or may not influence absenteeism. While the accuracy of k-means and hierarchical clustering can vary, they offer a structured way to analyze absenteeism data. They can be improved by refining categories and examining data more closely. Ultimately, these techniques help organizations make informed decisions to reduce absenteeism and improve productivity.

0
Subscribe to my newsletter

Read articles from Vijaykrishna directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vijaykrishna
Vijaykrishna

I’m a data science enthusiast who loves to build projects in KNIME and share valuable tips on this blog.