Employee Data Insights: Visualization Review

Introduction

The HR department needs to frame strategies that benefit the employees and do not create complexity in understanding data. The HR professional must understand the data before making decisions or performing advanced analysis. Thus, employee data exploration starts with organizing data into a highly usable and valuable resource and extracting information to improve the functioning of the HR department. Through data visualization, the HR professional can better understand the dynamics of the employee data across multiple dimensions. This chapter discusses different visualization techniques in KNIME that help to analyze and visualize data according to the requirement.

Employee Data

To understand the usage of KNIME, the IBMdata.csv data for HR Analytics is considered. This dataset can be downloaded from the “Data” folder on the KNIME Community Hub.

The Reviewing Employee Details – Data Visualization workflow can be downloaded from the KNIME Community Hub.

You are ready after downloading the CSV file from the Data folder.

Reading the CSV-based dataset

The first step in our analysis will be to load the data for our exploratory analyses. We will do this first step using the CSV Reader node before we persist our analysis in a KNIME table.

The KNIME table is created by loading the IBMdata CSV dataset. The above table shows that the employee dataset has 1470 observations and 35 columns. Most columns are integers, and some, such as Gender, JobRole, MaritalStatus, etc., are string columns.

Distribution of Employees by Department

Objective: To create a pie chart corresponding to the department

Pie charts visualize the absolute and relative frequencies. A pie chart is a circle partitioned into segments where each of the segments represents a category. The size of each segment depends upon the relative frequency and is determined by the angle. A pie chart represents values as slices of a circle with different colors.

We will use the Pie Chart node to plot the number & percentage of employees in each Department.

The interactive pie chart shows the employee distribution of the three different departments.

Filter / Subset observations for the HR Department

For our data visualization, we will consider only employees belonging to the HR department. We can use the Row Filter node to filter out the observations for the “Human Resources” Department.

The filtered table shows that the employee dataset for the HR department has 63 observations and 35 columns.

Data Visualization using Bar Chart

Objective: To create an interactive bar chart to display category-wise count for the number of years since the last promotion

The count function of the bar chart is used to show the number of observations corresponding to the different values of the specified column. To begin with, we will use the Number to String node to convert the YearsSinceLastPromotion from integer to string. Then, we will use the Sorter node to order the table in ascending order of the YearsSinceLastPromotion column. Subsequently, we will use the Bar Chart node to create charts showing count (aggregation) values for YearsSinceLastPromotion for HR department employees.

The figure shows 24 observations corresponding to 0 years, 17 observations corresponding to 1 year, eight observations corresponding to 2 years, three observations corresponding to 3, 4, 5, and 7 years, and one observation for 10 and 12 years. The higher value of categorical columns corresponding to the maximum frequency of observations is displayed at the top. The chart shows that the organization has effective promotion policies because most employees are given promotions within 1 year.

Data Visualization using Line Plot

Objective: To create a line chart corresponding to the hourly rate

We will use the Line Plot node to create an interactive line chart based on the HourlyRate of employees in the HR department.

The line chart shows employees' hourly rates range from 30 to 100. However, it should be noted that the line chart is effective for time series data. Here, it is drawn based on row number, which does not signify any vital meaning. The intention is to familiarize the user with the concept of a line chart.

Data Visualization using Multiple Bar Chart

Objective: To create an interactive bar chart for different continuous columns according to gender

We will use the Bar Chart node to create multiple charts showing average (aggregation) values for Age, HourlyRate, TotalWorkingYears, and YearsAtCompany, with Gender as the category dimension column for HR department employees.

The output shows a pair of four-bar charts corresponding to four columns, one bar for each category. The chart shows that male and female employees have nearly the same average total experience and age. The average hourly rate of males is higher than that of female employees. Male employees have spent more years with the company than female employees.

Data Visualization using Correlation Matrix

Objective: To create a correlation matrix plot for multiple columns

We will use the Linear Correlation node to generate a correlation matrix heatmap based on a selected set of columns for HR department employees.

The squared table shows the pair-wise correlation values of all columns. The color range varies from dark red (strong negative correlation) to dark blue (strong positive correlation). If a correlation value for a pair of columns is unavailable, the corresponding cell contains a missing value (shown as a cross in the color view). Hovering the mouse over each box in the views window will display the corresponding pair-wise correlation value.

Data Visualization using Box Plot

Objective: To create an interactive box plot for daily rate by education field

We will use the Box Plot node to create an interactive plot based on theDailyRate, with EducationField as the condition column for HR department employees.

A box plot describes the distribution of a continuous variable by plotting the summary of the following statistical terms: The box itself goes from the lower quartile (Q1) to the upper quartile (Q3). The median is drawn as a horizontal bar inside the box. The distance between Q1 and Q3 is called interquartile range (IQR). Above and below the box are the so-called whiskers. They are drawn at the minimum and maximum value as horizontal bars and are connected with the box with a line.

The figure shows the general structure of an interactive box plot. The box plot is drawn for daily rate for education fields. There are five categories in education: medical, human resources, life sciences, technical degree, and others. Hence, five boxes are shown in the figure corresponding to these five categories. We can determine the minimum and maximum values along with the first and third quartiles. It is clear from the chart that the median value of the daily rate is highest for “other” while it is the least for “Technical Degree”. The interquartile range of daily rate is highest in “Life Sciences”. The lowest daily rate is for “Human Resources”.

Data Visualization using Violin Plot

Objective: To create an interactive violin plot for gender for years in the current role

We will first use the Color Manager node to assign specific colors for the categories (male and female) in the Gender column. Subsequently, we will use the Violin Plot (Plotly) node to create an interactive plot based on theYearsInCurrentRole grouped by Gender (male and female) for employees in the HR department.

The violin plot combines a box plot and a kernel density plot. The violin chart draws a shape depending on the frequency. Like a box plot, an interactive violin plot can produce multiple violins corresponding to different values of a categorical column. We can observe from the chart's shape that the maximum number of female employees spent 2 years in their current role, while the maximum number of male employees spent 3 years in their current role.

Data Visualization using Density Plot

Objective: To create an interactive density plot for years at the company by gender

We will use the Color Manager node configured earlier to assign specific colors for the categories (male and female) in the Gender column. Subsequently, we will use the Density Plot node to create an interactive plot with YearsAtCompany as the dimension column and Gender (male and female) as the condition column for employees in the HR department.

A kernel density estimate plot is a method for visualizing the distribution of observations in a dataset analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions. The plot shows that male employees have spent more years in the HR department.

Data Visualization using Scatter Plot

Objective: To create an interactive scatter plot for age and total working years by gender

We will use the Color Manager node configured earlier to assign specific colors for the categories (male and female) in the Gender column. Subsequently, we will use the Scatter Plot node to create an interactive plot based on Age and TotalWorkingYears with Gender (male and female) as the color dimension for employees in the HR department.

The figure shows a scatter plot for age to total working years. The pattern indicates that a line covers the maximum points. The other points lie near the line. The chart shows that working years also increase with increasing age.

Data Visualization using Scatter Plot Matrix

Objective: To create a scatter plot matrix for age, hourly rate, daily rate, and distance from home.

We will use the Color Manager node configured earlier to assign specific colors for the categories (male and female) in the Gender column. Subsequently, we will use the Scatter Plot Matrix node to create a pair-wise scatter plot matrix of Age, DailyRate, HourlyRate, and DistanceFromHome columns with Gender (male and female) as the color dimension for employees in the HR department.

The figure shows a pair plot considering all the possible combinations for the four columns. Since there are four variables (n), a matrix of 3x3 images (n-1) is created. The scatter plot is drawn for every combination of two variables.

Data Visualization using Matplotlib

Objective: To create a shape chart corresponding to monthly income.

For this purpose, we will create a table with the Table Creator node containing four series of monthly income data. We will use the Matplotlib visualization library in the Python View node to create a custom chart. In the figure, we plot multiple lines with different shapes and colors. We can observe that for every x, y pair of arguments, there is an optional third argument, which is the format string that indicates the concatenation of a color string with a line-style string. Thus, the first line, Series1, is drawn using green triangles (g^); the second line, Series2, is drawn using blue squares (bs); the third line, Series3, is drawn using red dashes (r--); the fourth line, Series4, is drawn using magenta dots (mo).

import knime.scripting.io as knio

# Code to display multiple lines with different shapes and colors
import matplotlib.pyplot as plt
from io import BytesIO

df = knio.input_tables[0].to_pandas() 

# Plotting 
fig = plt.figure(figsize=(15,8))
plt.suptitle('Multiple Lines with Different Shapes and Colors',
fontsize=20, fontweight='bold')


# Creating a chart with different colors and effects for different series
plt.plot(df['List'], df['Series1'], 'g^', 
         df['List'], df['Series2'], 'bs',
         df['List'], df['Series3'], 'r--', 
         df['List'], df['Series4'], 'mo')

plt.xlabel('List')
plt.ylabel('Values')
plt.title('Multiple Series Plot')
plt.show()


# Setting limits of x-axis
plt.xlim(0,10)

# Setting limits for y-axis
plt.ylim(0,220)

# Displaying grid in chart
plt.grid(True,color='k')

# Create buffer to write into
buffer = BytesIO()

# Create plot and write it into the buffer
fig.savefig(buffer, format='svg')

# The output is the content of the buffer
output_image = buffer.getvalue()

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)  # alternative: knio.view_matplotlib()

Alternatively, we can plot the four series of monthly income using the Line Plot node, with each series in a different color line.

Data Visualization using Stacked Area Chart

Objective: To create an area chart for expenses of different months in four years.

For this purpose, we will create a table with the Table Creator node containing four years of monthly expense data. Subsequently, we will use the Stacked Area Chart node to create a stacked area chart visualization of the monthly expense data, with each year stacked one above the other. The difference between a stacked bar chart and an area chart is that the values are represented as cumulative values in a stacked bar chart. In contrast, an area chart defines them as individual values. These charts can track changes over time for two or more related groups that comprise one whole category.

The chart shows the total expenses for 12 months and each year. Thus, we can observe that the costs for 2013 are the highest since the area occupied is more than the expenses of any other year.

Summary

In this example, we explored using KNIME for hands-on data visualization. Working with the IBMdata.csv dataset and various KNIME charting and plotting nodes enabled us to interpret data for better understanding and extracting crucial insights.

Reviewing Employee Details – Data Visualization

Table of contents