Exploring & Manipulating Employee Data

Introduction

Organizations have a vast amount of employee data. The HR personnel in these organizations must understand these data and use basic statistical techniques to formulate effective strategies. They can use statistical methods such as mean, median, correlation, etc.; mathematical techniques such as minimum, maximum, etc.; and string operations such as concatenating data, modifying/updating records, etc. These basic techniques help the HR professional to have a better understanding of the data. This chapter discusses different functions in KNIME that help to analyze and manipulate data according to the requirement.

Employee Data

To understand the usage of KNIME, the IBMdata.csv for HR Analytics is considered. This dataset can be downloaded from the “Data” folder from the KNIME Community Hub.

You are ready after downloading the CSV file from the Data folder.

You can download the 02 Analyzing Employee Details - Data Exploration & Manipulation workflow from the KNIME Community Hub

Reading the CSV-based dataset

The first step in our analysis will be to load the data for our exploratory analyses. We will do this first step using the CSV Reader node before we persist our analysis in a KNIME table.

The KNIME table is created by loading the IBMdata CSV dataset.

The above table shows that the employee dataset has 1470 observations and 35 columns. Most columns are integers, and some, such as Gender, JobRole, MaritalStatus, etc., are string columns.

Data Exploration and Descriptive Summary Statistics

The Statistics node primarily determines the descriptive summary statistics of the columns in the dataset. This node calculates statistical moments such as minimum, maximum, mean, standard deviation, variance, median, overall sum, number of missing values, and row count across all numeric columns. It counts all nominal values together with their occurrences. The node provides the following three output tables

Statistics Table: All statistic moments for all numeric columns,
Nominal Histogram Table: Nominal values for all selected categorical columns and
Occurrences Table: The most frequent/infrequent values from the categorical columns (Top/bottom)

The Data Explorer node offers various options for displaying the properties of the input data in an interactive view.

Data Manipulation of String Columns

Joining of Strings

Objective: To display the employee details using concatenation of Department and EducationField string columns

A new CombinedString column can be created by concatenation (combining) of different strings. The strings can be combined by using the Join function of the String Manipulation node.

Case Conversion of Strings

Objective: To display the employee details in capitalized, uppercase, and lowercase format.

In the String Manipulation node, the capitalize (str) function capitalizes all whitespace-separated words in a string so that each word comprises a title-case character and then a series of lowercase characters.

We apply the capitalize function on the CombinedString column to create the CapCombinedString column.

The uppercase(str) and lowercase(str) functions convert all the characters in a string into uppercase and lowercase, respectively.

Search and Replace of String

Objective: To search for a particular string and replace it with another

The replace(str, search, replace) function of the String Manipulation node replaces all occurrences of a String within another String. In our CombinedString column, we search for the word “Employee” and replace it with “Staff” in the appended RepCombinedString column.

Search and Match of String

Objective: To determine and indicate whether the employee belongs to the Sales department

We use the LIKE function of the Rule Engine node to search for “Sales” in the CombinedString column and create a new column, Sales, with “Yes” or “No,” to indicate whether the employee belongs to the Sales department.

The Like function helps locate the string “Sales” in the given string column. If the department has a “Sales” word, the corresponding Sales column is populated with a “Yes” value and a “No” value otherwise. The above results show that the first employee belongs to the sales department, and the subsequent four do not.

Aggregated Statistics using Group by Operation

The GroupBy node helps group observations based on a categorical column. After grouping observations, we can apply multiple aggregation functions, such as mean, median, minimum, maximum, count, sum, covariance, variance, standard deviation, etc.

We will use the GroupBy node to get the aggregated statistics for multiple columns across each Gender (male and female).

Mean of DistanceFromHome
Median of DistanceFromHome
Minimum Age
Maximum Age
Count of EmployeeCount
Sum of DailyRate
Covariance of HourlyRate
Variance of HourlyRate
Standard deviation of HourlyRate

Group by with Sorting

Objective: To determine the employee count and percent for different marital status

We will use the GroupBy node to get the distinct employee count and percentage across each marital status (group by MaritalStatus, aggregate count, and percent by EmployeeNumber). Subsequently, we use the Sorter node to order the resultant table in descending order of count.

We can observe three distinct values of Marital Status: Single, Married, and Divorced. The number of observations of married employees is maximum (673, 45.782%) and displayed at the top; followed by the number of observations of single employees (470, 31.973%), and observations having divorced employees is minimum (327, 22.245%) and displayed at the bottom.

Data Filtering (subset) using AND and OR operation

Objective: To filter the data of female employees belonging to the sales department

We will use the Row Filter node and add the criterion1 filter column as Gender (value = “Female”) and the criterion2 filter column as Department (value = “Sales”). By selecting the “All criteria” option, the “AND” filter condition is applied (189 rows in the resultant filtered table). On the other hand, by choosing the “Any criteria” option, the “OR” filter condition is applied (845 rows in the resultant filtered table).

Determine Correlation Using Spearman Correlation Coefficient

Objective: To determine the correlation between hourly rate and job satisfaction

We will use the Rank Correlation node to determine the Spearman correlation coefficient of the HourlyRate and JobSatisfaction columns. The node calculates a correlation coefficient for each pair of selected columns, i.e., a measure of the correlation of the two variables. All measures are based on the rank of the cells. By executing the node, we get three outputs for statistical inference. Correlation measure contains correlation variables, p-values, and degrees of freedom. The correlation matrix displays the correlation variables in a matrix representation. The Rank table contains the fractional ranks of the columns where the rank corresponds to the position of the value in a sorted table.

The correlation between the two groups lies between -1 and +1. The value -1 denotes a perfect negative correlation, and +1 denotes a perfect positive correlation. The value 0 denotes no correlation. The value between 0 and 1 denoted the extent of correlation. A lesser value denotes a lower correlation, and a higher value denotes a higher correlation. From the output, we can see that the Spearman correlation is -0.06; this means a low correlation between the two groups. We know the null hypothesis is rejected if the p-value is <0.05. Our data shows that the p-value is less than 0.05, which means the null hypothesis is rejected. This means there exists a negative correlation between hourly rate and job satisfaction.

Determine Correlation Using Pearson Correlation Coefficient

Objective: To determine the correlation between monthly rate and job satisfaction

We will use the Linear Correlation node to determine the Spearman correlation coefficient of the MonthlyRate and JobSatisfaction columns. The node calculates a correlation coefficient for each pair of selected columns, i.e., a measure of the correlation of the two variables. By executing the node, we get two outputs for statistical inference. The correlation measure contains correlation variables, p-values, and degrees of freedom. The correlation matrix displays the correlation variables in a matrix representation.

The correlation coefficient is 0.001, which is very low. Also, the p-value (0.98) is more than 0.05, which means the null hypothesis failed to be rejected. This means no significant correlation exists between monthly rate and job satisfaction.

Summary

In conclusion, this article demonstrated the effective use of KNIME for data exploration and manipulation, specifically in HR analytics using the IBMdata.csv dataset. By leveraging various KNIME nodes, we performed a comprehensive exploratory data analysis, including reading and summarizing data, manipulating string columns, and conducting statistical analyses such as correlation. These techniques provide HR professionals with valuable insights into employee data, enabling them to make informed decisions and develop strategic initiatives. The hands-on approach with KNIME showcases its versatility and power in handling complex data tasks, making it an essential tool for data-driven decision-making in organizations.

Analyzing Employee Details – Data Exploration & Manipulation

Table of contents