Everything you need to know about Categorical features in your dataset

Categorical variables are an essential part of data analysis and machine learning models. They are a type of variable that takes on a limited number of values, which represent various characteristics of the data. In this article, we will discuss the types of categorical variables, their importance in machine learning, and how to handle them in different scenarios.

Difference between Categorical and Numerical Variables

Categorical variables are distinct and have no numerical value. They are often represented by text, symbols, or categories. On the other hand, numerical variables are measured on a continuous or discrete scale and can be represented by a range of numbers.

Types of Categorical Variables

Categorical variables can be classified into two types: nominal and ordinal variables.

Nominal Variables: Nominal variables are variables that have two or more categories, but there is no order or hierarchy among them. For example, gender, country, or favourite color. Nominal variables can only be categorized based on the frequency of occurrence of a category.

Ordinal Variables: Ordinal variables are variables that have categories with an inherent order or hierarchy among them. For example, educational level, socioeconomic status, or customer satisfaction levels. Ordinal variables can be categorized based on their order or hierarchy.

Importance of Categorical Variables

Categorical variables are essential in machine learning models as they provide valuable information about the data that numerical variables cannot. They can help identify patterns and relationships in the data that would be difficult to detect otherwise.

Categorical variables are also useful in creating features that can be used to predict outcomes in a machine learning model. For example, using demographic information such as age, gender, and income level to predict the likelihood of a customer purchasing a particular product.

Handling Categorical Variables

Before we dive in it is important to know that we why need to convert categorical features into numerical features. And the reason is simple, machines can only understand numbers, and that too binary i.e. 0 and 1. So this is why converting categorical variable to numerical is important for machines to understand and predict with better accuracy. Let's check what are different ways to convert categorical features to numerical:

One-Hot Encoding: One-Hot Encoding is a popular technique for encoding categorical variables that have no order or hierarchy among the categories. In this technique, each category is transformed into a binary variable. For example, consider a categorical variable "Country" with three categories: USA, UK, and Canada. The one-hot encoding technique would transform this variable into three binary variables: "Is USA," "Is UK," and "Is Canada." Each binary variable takes a value of 1 if the observation belongs to that category and 0 otherwise.
```
 import pandas as pd
 from sklearn.preprocessing import OneHotEncoder

 # Sample data
 data = pd.DataFrame({'Country': ['USA', 'UK', 'Canada', 'USA', 'Canada']})

 # One-Hot Encoding
 encoder = OneHotEncoder()
 encoded_data = encoder.fit_transform(data[['Country']]).toarray()
 one_hot_encoded_data = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(['Country']))

 print(one_hot_encoded_data)
```
Output:
```
    Country_Canada  Country_UK  Country_USA
 0             0.0         0.0          1.0
 1             0.0         1.0          0.0
 2             1.0         0.0          0.0
 3             0.0         0.0          1.0
 4             1.0         0.0          0.0
```
As you can see, the one-hot encoding technique has transformed the "Country" variable into three binary variables: "Country_Canada," "Country_UK," and "Country_USA." Each binary variable takes a value of 1 if the observation belongs to that category and 0 otherwise. For example, the first observation belongs to USA, so "Country_USA" is 1 and the other two binary variables are 0. This type of encoding can be useful when the categorical variable does not have an inherent order or ranking, and each category is equally important.
Label Encoding: Label Encoding is used to encode categorical variables that have a natural order or hierarchy among the categories. In this technique, each category is assigned a numerical value based on its order or hierarchy. For example, consider a categorical variable "Education" with four categories: High School, Associate, Bachelor's, and Masters. The label encoding technique would transform this variable into four numerical values: 0, 1, 2, and 3, respectively.
```
 import pandas as pd
 from sklearn.preprocessing import LabelEncoder

 # Sample data
 data = pd.DataFrame({'Education': ['High School', 'Associate', 'Bachelor\'s', 'Masters', 'High School']})

 # Label Encoding
 encoder = LabelEncoder()
 data['Education_LabelEncoded'] = encoder.fit_transform(data['Education'])

 print(data)
```
Output:
```
      Education  Education_LabelEncoded
 0  High School                       1
 1    Associate                       0
 2   Bachelor's                       2
 3      Masters                       3
 4  High School                       1
```
As you can see, the label encoding technique has transformed the "Education" variable into four numerical values: 0, 1, 2, and 3, respectively. For example, "High School" is assigned the value 1, "Associate" is assigned the value 0, and so on. This type of encoding can be useful when the categorical variable has an inherent order or ranking, such as education levels. However, it's important to note that some algorithms may interpret the encoded values as having an ordered relationship, which may not be appropriate for all use cases.
Ordinal Encoding: Ordinal Encoding is used to encode categorical variables that have a natural order or hierarchy among the categories. In this technique, each category is assigned a numerical value based on its order or hierarchy. However, the numerical values are assigned such that they represent the relative order of the categories. For example, consider a categorical variable "Satisfaction" with five categories: Very Unsatisfied, Unsatisfied, Neutral, Satisfied, and Very Satisfied. The ordinal encoding technique would transform this variable into five numerical values: 1, 2, 3, 4, and 5, respectively.
```
 import pandas as pd

 # Sample data
 data = pd.DataFrame({'Satisfaction': ['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']})

 # Ordinal Encoding
 satisfaction_mapping = {'Very Unsatisfied': 1, 'Unsatisfied': 2, 'Neutral': 3, 'Satisfied': 4, 'Very Satisfied': 5}
 data['Satisfaction_Ordinal_Encoded'] = data['Satisfaction'].map(satisfaction_mapping)

 print(data)
```
Output:
```
        Satisfaction           Satisfaction_Ordinal_Encoded
 0  Very Unsatisfied                             1
 1       Unsatisfied                             2
 2           Neutral                             3
 3         Satisfied                             4
 4    Very Satisfied                             5
```
As you can see, the ordinal encoding technique has transformed the "Satisfaction" variable into numerical values ranging from 1 to 5, based on the order of the categories from "Very Unsatisfied" to "Very Satisfied". This type of encoding is useful when the categorical variable has an inherent order or ranking, as in the case of satisfaction levels.
Target Encoding: Target Encoding is used to encode categorical variables that have no order or hierarchy among the categories. In this technique, each category is replaced with the mean of the target variable for that category. For example, consider a categorical variable "City" with five categories: New York, Los Angeles, Chicago, Houston, and Miami, and a target variable "Sales." The target encoding technique would transform the "City" variable into five numerical values representing the mean sales for each category.

import pandas as pd
import category_encoders as ce

# Sample data
data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
                     'Sales': [10000, 8000, 5000, 7000, 9000]})

# Target Encoding
target_encoder = ce.TargetEncoder(cols=['City'])
data = target_encoder.fit_transform(data, data['Sales'])

print(data)

Output:

        City        Sales
0     8200.000  10000
1     7250.000   8000
2     6000.000   5000
3     7000.000   7000
4     9000.000   9000

As you can see, the target encoding technique has transformed the "City" variable into numerical values representing the mean sales for each category. For example, the mean sales for New York is 10000, so the target encoded value for New York is 8200. Similarly, the mean sales for Los Angeles is 8000, so the target encoded value for Los Angeles is 7250, and so on. This type of encoding can be useful when the categorical variable has a strong correlation with the target variable and can provide additional information to the model.

Conclusion

Categorical variables are an essential component of data analysis and machine learning models. They provide valuable information that numerical variables cannot, and they are useful in creating features that can be used to predict outcomes in a machine learning model. Handling categorical variables requires careful consideration of their type and the available encoding techniques. One-hot encoding, label encoding, ordinal encoding, and target encoding are some of the popular encoding techniques used to handle categorical variables. Hope you got some value out of this article Subscribe to the newsletter to get more such blogs.

Thanks :)