Understanding One-Hot Encoding and Dummy Variables

The way of representing categorical data is known as one-hot encoding or dummy variables. Because we cannot proceed with categorical data like a machine learning model, it does not understand text data; it only understands numerical data.

And we cannot number the data if we have 26 variables; then we can do something like assign them numbers from 0 to 25. But this is not recommended because ML models read numbers as a higher priority, which can lead to wrong output. For that, we use special encoding as dummy variables.

Let’s say for the workclass feature we have possible values of "Government Employee", "Private Employee", "Self-Employed", and "Self-Employed Incorporated". To encode these four possible values, we create four new features, called "Government Employee", "Private Employee", "Self-Employed", and "Self-Employed Incorporated". A feature is 1 if the workclass for this person has the corresponding value and 0 otherwise, so exactly one of the four new features will be 1 for each data point. This is why this is called one-hot or one-out-of-N encoding.

workclass	Government Employee	Private Employee	Self Employed	Self-Employed Incorporated
Government Employee	1	0	0	0
Private Employee	0	1	0	0
Self Employed	0	0	1	0
Self-Employed Incorporated	0	0	0	1

The one-hot encoding we use is similar to dummy encoding in statistics, but not the same. We make a separate binary feature for each category. In statistics, it's common to turn a category with k possible values into k–1 features (the last one is shown as all zeros). This helps make the analysis easier and prevents issues with the data matrix.

There are two ways to convert your data to a one-hot encoding of categorical variables, using either pandas or scikit-learn.

print("Original features:\n", list(data.columns), "\n")
data_dummies = pandas.pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))

Summary

One-hot encoding is used to represent categorical data numerically, as machine learning models require numerical input. This method creates a separate binary feature for each category, avoiding priority issues that arise from assigning arbitrary numbers to categories. For example, a "workclass" feature with categories like "Government Employee" or "Private Employee" is transformed into multiple binary features. In contrast to traditional dummy encoding, which uses k-1 features for k categories, one-hot encoding uses all k features. Both pandas and scikit-learn can be used to perform one-hot encoding.

one-hot-encoding or one-out-of-N encoding, also known as dummy variables.

Subscribe to my newsletter

Aaryan meena

Aaryan meena