Identify Core Factors in Performance Appraisal - Principal Component Analysis

VijaykrishnaVijaykrishna
19 min read

Introduction

Performance appraisal is a regular review of an employee’s job performance and overall contribution to a company. It evaluates an employee’s attitude, knowledge, skills, commitment, achievements, etc. Organizations use performance appraisals to justify pay increases, bonuses, and termination decisions. They can be conducted at any given time but are annual, semi-annual, or quarterly. Organizations need to provide employees with big-picture feedback on their work and consider many dimensions that contribute to measuring performance appraisal. However, since the number of dimensions is large, grouping similar dimensions is necessary. This article focuses on determining important factors that emerge by grouping similar dimensions using two essential techniques: exploratory factor analysis and principal component analysis. The methods discussed can be used to determine factors related to different aspects of the HR department.

Unsupervised Machine Learning

Unsupervised machine learning algorithms are used when the output is unknown, and no predefined instructions are available to the learning algorithms. In unsupervised learning, the learning algorithm only has input data, and knowledge is extracted from these data. These algorithms create a new representation of the data that is easier to comprehend than the original data and help improve the accuracy of advanced algorithms by consuming less time and reducing memory. Standard unsupervised machine learning algorithms include association rule mining, dimensionality reduction algorithms, and clustering.

Dimension Reduction Techniques

The dimension reduction algorithms take a high dimensional representation of data as input, which consists of many features, and produce an output that summarizes the data by grouping essential characteristics and results into fewer factors. The two standard dimensionality reduction algorithms include principal component analysis and factor analysis. The principal component analysis replaces a large number of correlated variables with a smaller number of correlated variables. It is used to understand the data's structure, shape, and covariance, which is impossible with simple scatter plots. It is a method that rotates the dataset so that the rotated features are statistically uncorrelated. This method is used to summarize the data and reduce their dimensionality. In contrast, exploratory factor analysis can be used as a hypothesis-generating helpful tool for understanding the relationship between large numbers of variables.

When we deal with vast amounts of data, we are not sure about the usefulness of the information collected, and deriving helpful information becomes tedious. However, we cannot drop some variables to derive useful information, assuming they are not beneficial. For example, suppose the number of variables is more. Applying some tests, creating scatter plots, finding correlations between variables, and interpreting the data is not easy in that case. There would be too many pair-wise correlations between the variables to consider. It will also be challenging to comprehend the data through the graphical display. Hence, it is necessary to club these variables together and reduce the number of variables to a few interpretable linear combinations of the data to interpret the data in a more meaningful form. Each linear combination represents a principal component or a factor. Thus, dimensionality reduction is helpful when we have many variables in our dataset and we need to reduce this number or when responses for many questions tend to be highly correlated. This technique is generally used before performing t-test, ANOVA, regression, or cluster analysis on a dataset with correlated variables.

Performance Appraisal Data

The PerformanceAppraisal.csv file helps us understand KNIME usage. This dataset can be downloaded from the “Data” folder on the KNIME Community Hub.

The Identify Core Factors in Performance Appraisal – Principal Component Analysis workflow can be downloaded from the KNIME Community Hub.

After downloading the CSV file from the Data folder, you are ready.

Reading the CSV-based dataset

The first step in our analysis will be to load the data for our exploratory analyses. We will do this first step using the CSV Reader node before we persist our analysis in a KNIME table.

The KNIME table is created by loading the PerformanceAppraisal.csv CSV dataset. The above table shows that the employee dataset has 322 observations and 29 columns. Each column has values between one and five based on the Likert scale.

Data Exploration

The Statistics node primarily determines the descriptive summary statistics of the columns in the dataset. This node calculates statistical moments such as minimum, maximum, mean, standard deviation, variance, median, overall sum, number of missing values, and row count across all numeric columns. It counts all nominal values together with their occurrences. The node provides the following three output tables.

  1. Statistics Table: All statistic moments for all numeric columns,

  2. Nominal Histogram Table: Nominal values for all selected categorical columns and

  3. Occurrences Table: The most frequent/infrequent values from the categorical columns (Top/bottom)

The above table shows that each column has values between one and five based on the Likert scale.

Data Visualization

We will use the Linear Correlation node to generate a correlation matrix heatmap based on a selected set of columns.

The squared table shows the pair-wise correlation values of all columns. The color range varies from dark red (strong negative correlation) to dark blue (strong positive correlation). If a correlation value for a pair of columns is unavailable, the corresponding cell contains a missing value (shown as a cross in the color view). Hovering the mouse over each box in the views window will display the corresponding pair-wise correlation value. It is clear from the chart that there is a correlation between many groups of variables.

Determine Factors of Performance Appraisal System Using Exploratory Factor Analysis

Objective: To determine factors of performance appraisal system using exploratory factor analysis

Factor analysis is an exploratory data analysis method used to search for critical underlying factors or latent variables from a set of observed variables. It helps in data interpretation by reducing the number of variables. It is widely utilized in nearly all specializations where we need to reduce the number of existing features, like market research, advertising, finance, etc. Factor analysis is a linear statistical model. It explains the variance among the observed variables and reduces a set of observed variables to unobserved variables called factors.

Factors are latent, hidden, unobserved, or hypothetical variables. A factor describes the association among the number of observed variables. The maximum number of factors is equal to the number of observed variables. Every factor explains a specific variance in observed variables with common patterns or responses. Each factor explains a particular amount of variance in the observed variables. It is a method for investigating whether the variables N1, N2,…, Nn are linearly related to a small number of unobserved factors F1, F2,…, Fn.

Some common assumptions that need to be fulfilled before applying factor analysis include the following: The data should not contain outliers, the sample size should be greater than the factor, and there should not be homoscedasticity among the variables.

There are two types of factor analysis:

  1. Exploratory Factor Analysis: EFA is a commonly used approach for reducing the number of features. The basic assumption is that any observed variable is directly associated with any factor. EFA is a statistical technique for identifying latent relationships among sets of observed variables in a dataset. In particular, EFA seeks to model a large set of observed variables as linear combinations of some smaller unobserved latent factors.

  2. Confirmatory Factor Analysis: In CFA, the basic assumption is that each factor is associated with a particular set of observed variables.

For our analysis, we will apply EFA using the “factor_analyzer” Python library in the Python Script node. Before applying factor analysis, it is critical to evaluate whether determining the factors in the dataset is possible. There are two methods to check the factorability or sampling adequacy:

  1. Bartlett’s Test: Bartletts’ test of sphericity checks if the observed variables intercorrelate at all using the observed correlation matrix against the identity matrix. You should not use a factor analysis if the test is statistically insignificant. If the p-value is <0.05, the test will be considered statistically significant, indicating that the observed correlation matrix is not an identity matrix.

  2. Kaiser-Meyer-Olkin (KMO) Test: The KMO test measures the suitability of data for factor analysis. It determines the adequacy for each variable and for the complete model. The KMO test estimates the proportion of variance among all the observed variables. A lower proportion is more suitable for factor analysis. The range of KMO values is between 0 and 1. A KMO value of less than 0.5 is considered inadequate for performing factor analysis.

import knime.scripting.io as knio
import pandas as pd

# Load the input data
df = knio.input_tables[0].to_pandas()

# Importing necessary libraries
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity

chisquare,p_value=calculate_bartlett_sphericity(df)

print("Chi Square value of Bartlett test: ",chisquare.round(3))
print("p value of Brtlett test: ",p_value.round(3))

# Applying KMO test for determining adequacy
from factor_analyzer.factor_analyzer import calculate_kmo

kmo_values,kmo_model=calculate_kmo(df)

print("KMO model:",kmo_model.round(3))
print("KMO values:\n",kmo_values.round(3))

# Convert datalist to a DataFrame for output
output_df = pd.DataFrame(kmo_values)

# Output the datalist as a table
knio.output_tables[0] = knio.Table.from_pandas(output_df)

The p-value of Bartlett’s test is 0, meaning it is statistically significant. The overall KMO value for our data is 0.786, which is excellent. The value of both tests indicates that factor analysis can be executed since the condition of adequacy is met.

Exploratory Factor Analysis

The primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables. The factors with the lowest amount of variance should be dropped. Rotation, which can be orthogonal or oblique, is a tool for better interpretation of factor analysis. It redistributes the commonalities with a clear pattern of loadings.

We will apply EFA using the “factor_analyzer” Python library in the Python Script node. The eigenvalues are determined using the get_eigenvalues() function, which returns the eigenvalues for each variable. It tries to determine the optimum number of factors.

import knime.scripting.io as knio
import pandas as pd
from factor_analyzer import FactorAnalyzer

# Load the input data
df = knio.input_tables[0].to_pandas()

# Perform factor analysis
fa = FactorAnalyzer()
fa.fit(df)

# Determining eigenvalues
eigen_values, vectors = fa.get_eigenvalues()  # Corrected unpacking of the returned values
print("Eigen values are: \n", eigen_values.round(3))

# Convert eigenvalues to a DataFrame for output
EigenValues_df = pd.DataFrame(eigen_values, columns=["EigenValues"])

# Output the eigenvalues as a table
knio.output_tables[0] = knio.Table.from_pandas(EigenValues_df)

# Get factor loadings
factor_loadings = fa.loadings_.round(4)

# Convert factor loadings to a DataFrame for output
FactorLoadings_df = pd.DataFrame(factor_loadings, index=df.columns, columns=[f"Factor {i+1}" for i in range(factor_loadings.shape[1])])

# Output the factor loadings as a table
knio.output_tables[1] = knio.Table.from_pandas(FactorLoadings_df)

# Get factor variance (eigenvalues, proportion of variance, cumulative variance)
fa_variance = fa.get_factor_variance()

# Convert factor variance to a DataFrame for output
FactorVariance_df = pd.DataFrame({
    "Eigenvalues": fa_variance[0],
    "Proportion of Variance": fa_variance[1],
    "Cumulative Variance": fa_variance[2]
})

# Output the eigenvalues as a table
knio.output_tables[2] = knio.Table.from_pandas(FactorVariance_df)

One of the difficult things to determine when conducting a factor analysis is to determine the number of factors. Different metrics such as eigenvalue, total percent variance explained, factor loadings, and line plots are used to decide the number of factors. The Kaiser criterion is an analytical approach based on selecting the factor with a more significant proportion of explained variance. The eigenvalue is a good criterion for determining the optimum number of factors. Eigenvalues represent variance explained by each factor from the total variance. Generally, an eigenvalue greater than one will be considered the feature's selection criteria. Eigenvalues are determined by using the get_eigenvalues() function. A line plot is a graphical approach based on visualizing factors and eigenvalues. It helps us to determine the number of factors where the curve makes an elbow.

One of the criteria is to decide the number of factors based on an eigenvalue greater than or equal to one. This is because a factor with an eigenvalue of 1 accounts for as much variance as a single variance, and the logic is that only factors that explain at least the same amount of variance as a single variable are worth keeping. However, it is observed that this sometimes results in more factors. From the results, we can observe that eigenvalues are more significant than one for eight variables, and for two variables, it is nearly 1.0. We will try to understand using another option, a line plot. A line plot shows the eigenvalues on the y-axis and the number of factors on the x-axis. It always displays a downward curve. The point where the slope of the curve is leveling off (the elbow) indicates the number of factors the analysis should generate. A chart depicting the number of factors and the eigenvalues ranges from 1 to the maximum number of variables. The curve shows a straight line for each factor and its eigenvalues. However, the optimum number of factors for this dataset might be six.

It is clear from the results that both line plot and eigenvalues are yielding an unreasonably high number of factors. In our example, a cutoff of an eigenvalue greater than or equal to one would result in eight factors. And the line plot suggests six factors due to how the slope levels off. It is important to remember that one of the reasons for running a factor analysis is to reduce the large number of variables that describe a complex concept, such as a performance appraisal system, to a few interpretable latent variables. In other words, we would like to find fewer interpretable factors that explain the maximum variability in the data. Therefore, another essential metric to remember is the total variability of the original variables explained by each factor solution. For example, if three factors explain most of the variability in the original 12 variables, then those factors are a good, more straightforward substitute for all 12 variables. However, if it takes six factors to explain most of the variance in those 12 variables, then the purpose of factor analysis is not solved.

The matrix of weights, or factor loadings, generated from an EFA model describes the underlying relationships between each variable and the latent factors. Factor loadings are similar to standardized regression coefficients, and variables with higher loadings on a particular factor can be interpreted as explaining a more significant proportion of the variation in that factor. Factor loading matrices are often rotated after the factor analysis model is estimated to produce a more straightforward, more interpretable structure to identify which variables are loading on a particular factor. The factor loading is a matrix that shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for the observed variable and factor. It is essential to determine the variance explained by each factor. The factor loadings are determined by loadings_, and the variance explained by the factors is determined by the get_factor_variances() function.

It should be noted that each identified factor should have at least three variables with high factor loadings and that each variable should load highly on only one factor. The preceding result displays the factor loadings and variances of each factor. From the factor loadings matrix, we can observe the loading of each variable on each factor. We need to determine the highest loading of the variable in a particular factor. From the matrix, we can observe that there are six factors. The bold letters clearly show the highest loading on the respective factor. For this dataset, the optimum number of factors is six.

Factor analysis has some limitations. Its interpretation can be debatable because multiple interpretations of the same data factors can be made. Besides, factor identification and naming require domain knowledge.

Determine Factors of Performance Appraisal System Using Principal Component Analysis

Objective: To determine factors of performance appraisal system using PCA

Most problems of interest to organizations are multivariate. They contain multiple dimensions that must be looked at simultaneously. Many statistical analysis techniques, such as machine learning algorithms, are sensitive to the number of dimensions in a problem. High dimensionality can render a problem computationally intractable in the significant data era. Hence, dimensionality reduction aims to replace a more extensive set of correlated variables with a smaller set of derived variables and lose the minimum amount of information. The best way to minimize information loss is to preserve variance. Principal Component Analysis (PCA) is a data reduction technique that transforms more correlated variables into a much smaller set of correlated variables called principal components. Simply put, PCA is a method of extracting essential variables (components) from a large set of variables available in a dataset. It extracts a low-dimensional set of features from a high-dimensional dataset to capture as much information as possible. It is always performed on a symmetric correlation and covariance matrix. This means the matrix should be numeric and have standardized data. The main idea of PCA is to reduce the dimensionality of a dataset consisting of many variables correlated with each other while retaining maximum information (variance) in the dataset. Transforming original variables to a new set of variables is called PCA.

PCA is used to overcome the redundancy of features in a dataset. These features are low-dimensional. The components aim to capture as much information as possible with high explained variance. The first component has the highest variance, followed by the second, third, and so on. The components must be uncorrelated. PCA is applied to a dataset with numeric variables. It is a tool that helps produce better visualizations of high-dimensional data.

The first principal component is a linear combination of original predictor variables, which captures the maximum variance in the data set. It determines the direction of the highest variability in the data. The larger the variability captured in the first component, the larger the information captured by the component. No other component can have variability higher than the first principal component. The first principal component results in a line closest to the data; that is, it minimizes the sum of the squared distance between a data point and the line.

The second principal component is also a linear combination of original predictors, which captures the remaining variance in the dataset and is uncorrelated with the first principal component. In other words, the correlation between the first and second components should be zero. All succeeding principal components follow a similar concept; they capture the remaining variation without correlating with the previous component. For n x p dimensional data, min (n-1,p) principal components can be constructed. The directions of these components are identified in an unsupervised way because the response variable is not used to determine the component direction. Therefore, it is an unsupervised approach.

It is a linear orthogonal transformation that transforms the data to a new coordinate system such that the most significant variance by any data projection lies on the first coordinate, the second most considerable variance on the next coordinate, and so on. The analysis uses an orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables, and our objective is to maximize all the variance on the first principal component, then the second, and so on.

PCA components explain the maximum variance, while factor analysis explains the covariance in data. PCA components are entirely orthogonal to each other, whereas factor analysis does not require factors to be orthogonal. A PCA component is a linear combination of the observed variable, while in factor analysis, the observed variables are linear combinations of the unobserved variable or factor. PCA components are uninterpretable. In factor analysis, the underlying factors can be labeled and are interpretable.

PCA is a statistical procedure that transforms a dataset into a new dataset containing linearly uncorrelated variables, known as principal components. The basic idea is that the dataset is transformed into a set of components, each attempting to capture as much of the variance in data as possible. In the Python Script node, the PCA() function in the “sklearn.decomposition” library performs PCA by specifying the number of components.

import knime.scripting.io as knio
import pandas as pd
from sklearn.decomposition import PCA

# Load the input data
df = knio.input_tables[0].to_pandas()

# Ensure the input data contains only numeric values
df_numeric = df.select_dtypes(include=['number'])

# Perform Principal Component Analysis
pca = PCA(n_components=6)
pca.fit(df_numeric)
df_transformed = pca.transform(df_numeric)

# Convert the transformed data to a DataFrame
df_transformed_df = pd.DataFrame(df_transformed, columns=[f"PC{i+1}" for i in range(df_transformed.shape[1])])

# Output the transformed data as a table
knio.output_tables[0] = knio.Table.from_pandas(df_transformed_df)

# Get principal components
pca_components = pca.components_.round(3)

# Convert principal components to a DataFrame for output
pca_components_df = pd.DataFrame(pca_components.T, index=df_numeric.columns, columns=[f"Factor {i+1}" for i in range(pca_components.shape[0])])

# Output the factor loadings as a table
knio.output_tables[1] = knio.Table.from_pandas(pca_components_df)

# Get principal component wise explained variance
pca_variance = pca.explained_variance_ratio_.round(3)

# Convert factor variance to a DataFrame for output
pca_variance_df = pd.DataFrame(pca_variance)

# Output the eigenvalues as a table
knio.output_tables[2] = knio.Table.from_pandas(pca_variance_df)

Applying principal component analysis, 29 variables are reduced to six principal components using the command PCA(n_components=6) and pca.fit(PerformanceAppraisal) functions.

Thus, the reduced dataset has 322 rows and six columns, and the principal components dataset has 29 rows and six columns. The details of all components can be found using pca.components_. The resulting table displays the relationship of variables with the factors.

The variance explained by each component is determined using explained_variance_ratio_.

Thus, we can observe that the variance explained by the first component is 20.4%. The second principal component explains less variance than the first (10%), the third component explains 9% of the variance, the fourth component explains 6.3%, and the fifth component explains 5%. The sixth component explains 4.2% of the variance. Thus, the total variance explained by all the six components is nearly 55%.

It should be noted that since the principal component analysis and exploratory factor analysis derive their solutions from the correlation among the observed variables, we need to decide on the factor model that is the best fit for our research goals. We also need to determine how many component factors are required to extract. Then, we extract the components or factors and interpret the results. Finally, we compute the component or factor scores. We use the Python View node to create a color map depicting the components.

import knime.scripting.io as knio

# This example script plots a heatmap using matplotlib and assigns it to the output view
import matplotlib.pyplot as plt

# Only use numeric columns
data = knio.input_tables[0].to_pandas().select_dtypes('number')

# Convert the data to a numpy array
values = data.to_numpy()

# Plot the heatmap
fig, ax = plt.subplots(figsize=(15, 12))  # Adjusted figure size for better visualization
cax = ax.matshow(values, cmap='viridis')  # Use a colormap for better visualization
plt.colorbar(cax)

# Set x-ticks and y-ticks
ax.set_xticks(range(data.shape[1]))
ax.set_xticklabels(data.columns, rotation=45, ha='left')  # Rotate x-tick labels for better readability
ax.set_yticks(range(data.shape[0]))
ax.set_yticklabels(data.index)

# Annotate each cell with the numeric value
for i in range(values.shape[0]):
    for j in range(values.shape[1]):
        ax.text(j, i, f"{values[i, j]:.3f}", ha='center', va='center', color='white', fontsize=5)

# Set labels
ax.set_ylabel("Features")
ax.set_xlabel("Principal Components")

# Assign the figure to the output_view variable
knio.output_view = knio.view(fig)

From the table and chart, the highest loading of each variable on a particular factor is determined. It is clear from the table that.

  1. Factor 1: Job-based Attitude: Efficiency, Commitment, Job Knowledge, Completeness, Quality, Quantity, Creativity, Appropriateness, Timeliness, and Flexibility contribute to the first factor. All these characteristics are related to job performance. Hence, this factor can be called a Job-based attitude.

  2. Factor 2: Behavioral Aspects: Attitude, Availability, and Communication. All these characteristics pertain to an employee's behavior. Hence, this factor can be called behavioral aspects.

  3. Factor 3: Leadership Quality: Responsibility, Focus, Decision Making, and Stress Tolerance. Since all these traits are traits of a leader, this factor can be called Leadership quality.

  4. Factor 4: Team-based Attitude: Interpersonal Relations, Share Ideas, Resourcefulness, Confidence, and Acceptability. These characteristics depict an employee's performance when working in a team. Hence, this factor can be called a team-based attitude.

  5. Factor 5: Planning: Receptive, Initiative, Strategic Approach, and Appropriateness. These qualities are required for planning, which is why this factor can be termed planning.

  6. Factor 6: Problem-solving Attitude: Solution-oriented, Logical Approach, Judgment, and Dependability. These characteristics emphasize a problem-solving attitude; hence, this factor can be termed a problem-solving attitude.

Thus, 29 variables considered in the study related to performance appraisal can be reduced to six factors: problem-solving attitude, leadership quality, planning, behavioral aspects, team-based attitude, and job-based attitude.

Summary

In conclusion, performance appraisal is crucial for evaluating an employee's organizational contributions and effectiveness. Organizations can identify core factors that influence performance using techniques such as exploratory factor analysis and principal component analysis. These methods help reduce data’s complexity by grouping similar dimensions, making it easier to interpret and act upon. The study of performance appraisal data can reveal key factors such as job-based attitude, behavioral aspects, leadership quality, team-based attitude, planning, and problem-solving attitude. Understanding these factors allows organizations to provide more targeted feedback, improve decision-making regarding employee development, and enhance overall organizational performance.

0
Subscribe to my newsletter

Read articles from Vijaykrishna directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Vijaykrishna
Vijaykrishna

I’m a data science enthusiast who loves to build projects in KNIME and share valuable tips on this blog.