Everything you need to know about Categorical features in your dataset
Categorical variables are an essential part of data analysis and machine learning models. They are a type of variable that takes on a limited number of values, which represent various characteristics of the data. In this article, we will discuss the types of categorical variables, their importance in machine learning, and how to handle them in different scenarios.
Difference between Categorical and Numerical Variables
Categorical variables are distinct and have no numerical value. They are often represented by text, symbols, or categories. On the other hand, numerical variables are measured on a continuous or discrete scale and can be represented by a range of numbers.
Types of Categorical Variables
Categorical variables can be classified into two types: nominal and ordinal variables.
Nominal Variables: Nominal variables are variables that have two or more categories, but there is no order or hierarchy among them. For example, gender, country, or favourite color. Nominal variables can only be categorized based on the frequency of occurrence of a category.
Ordinal Variables: Ordinal variables are variables that have categories with an inherent order or hierarchy among them. For example, educational level, socioeconomic status, or customer satisfaction levels. Ordinal variables can be categorized based on their order or hierarchy.
Importance of Categorical Variables
Categorical variables are essential in machine learning models as they provide valuable information about the data that numerical variables cannot. They can help identify patterns and relationships in the data that would be difficult to detect otherwise.
Categorical variables are also useful in creating features that can be used to predict outcomes in a machine learning model. For example, using demographic information such as age, gender, and income level to predict the likelihood of a customer purchasing a particular product.
Handling Categorical Variables
Before we dive in it is important to know that we why need to convert categorical features into numerical features. And the reason is simple, machines can only understand numbers, and that too binary i.e. 0 and 1. So this is why converting categorical variable to numerical is important for machines to understand and predict with better accuracy. Let's check what are different ways to convert categorical features to numerical:
One-Hot Encoding: One-Hot Encoding is a popular technique for encoding categorical variables that have no order or hierarchy among the categories. In this technique, each category is transformed into a binary variable. For example, consider a categorical variable "Country" with three categories: USA, UK, and Canada. The one-hot encoding technique would transform this variable into three binary variables: "Is USA," "Is UK," and "Is Canada." Each binary variable takes a value of 1 if the observation belongs to that category and 0 otherwise.
import pandas as pd from sklearn.preprocessing import OneHotEncoder # Sample data data = pd.DataFrame({'Country': ['USA', 'UK', 'Canada', 'USA', 'Canada']}) # One-Hot Encoding encoder = OneHotEncoder() encoded_data = encoder.fit_transform(data[['Country']]).toarray() one_hot_encoded_data = pd.DataFrame(encoded_data, columns=encoder.get_feature_names(['Country'])) print(one_hot_encoded_data)
Output:
Country_Canada Country_UK Country_USA 0 0.0 0.0 1.0 1 0.0 1.0 0.0 2 1.0 0.0 0.0 3 0.0 0.0 1.0 4 1.0 0.0 0.0
As you can see, the one-hot encoding technique has transformed the "Country" variable into three binary variables: "Country_Canada," "Country_UK," and "Country_USA." Each binary variable takes a value of 1 if the observation belongs to that category and 0 otherwise. For example, the first observation belongs to USA, so "Country_USA" is 1 and the other two binary variables are 0. This type of encoding can be useful when the categorical variable does not have an inherent order or ranking, and each category is equally important.
Label Encoding: Label Encoding is used to encode categorical variables that have a natural order or hierarchy among the categories. In this technique, each category is assigned a numerical value based on its order or hierarchy. For example, consider a categorical variable "Education" with four categories: High School, Associate, Bachelor's, and Masters. The label encoding technique would transform this variable into four numerical values: 0, 1, 2, and 3, respectively.
import pandas as pd from sklearn.preprocessing import LabelEncoder # Sample data data = pd.DataFrame({'Education': ['High School', 'Associate', 'Bachelor\'s', 'Masters', 'High School']}) # Label Encoding encoder = LabelEncoder() data['Education_LabelEncoded'] = encoder.fit_transform(data['Education']) print(data)
Output:
Education Education_LabelEncoded 0 High School 1 1 Associate 0 2 Bachelor's 2 3 Masters 3 4 High School 1
As you can see, the label encoding technique has transformed the "Education" variable into four numerical values: 0, 1, 2, and 3, respectively. For example, "High School" is assigned the value 1, "Associate" is assigned the value 0, and so on. This type of encoding can be useful when the categorical variable has an inherent order or ranking, such as education levels. However, it's important to note that some algorithms may interpret the encoded values as having an ordered relationship, which may not be appropriate for all use cases.
Ordinal Encoding: Ordinal Encoding is used to encode categorical variables that have a natural order or hierarchy among the categories. In this technique, each category is assigned a numerical value based on its order or hierarchy. However, the numerical values are assigned such that they represent the relative order of the categories. For example, consider a categorical variable "Satisfaction" with five categories: Very Unsatisfied, Unsatisfied, Neutral, Satisfied, and Very Satisfied. The ordinal encoding technique would transform this variable into five numerical values: 1, 2, 3, 4, and 5, respectively.
import pandas as pd # Sample data data = pd.DataFrame({'Satisfaction': ['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']}) # Ordinal Encoding satisfaction_mapping = {'Very Unsatisfied': 1, 'Unsatisfied': 2, 'Neutral': 3, 'Satisfied': 4, 'Very Satisfied': 5} data['Satisfaction_Ordinal_Encoded'] = data['Satisfaction'].map(satisfaction_mapping) print(data)
Output:
Satisfaction Satisfaction_Ordinal_Encoded 0 Very Unsatisfied 1 1 Unsatisfied 2 2 Neutral 3 3 Satisfied 4 4 Very Satisfied 5
As you can see, the ordinal encoding technique has transformed the "Satisfaction" variable into numerical values ranging from 1 to 5, based on the order of the categories from "Very Unsatisfied" to "Very Satisfied". This type of encoding is useful when the categorical variable has an inherent order or ranking, as in the case of satisfaction levels.
Target Encoding: Target Encoding is used to encode categorical variables that have no order or hierarchy among the categories. In this technique, each category is replaced with the mean of the target variable for that category. For example, consider a categorical variable "City" with five categories: New York, Los Angeles, Chicago, Houston, and Miami, and a target variable "Sales." The target encoding technique would transform the "City" variable into five numerical values representing the mean sales for each category.
import pandas as pd
import category_encoders as ce
# Sample data
data = pd.DataFrame({'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Sales': [10000, 8000, 5000, 7000, 9000]})
# Target Encoding
target_encoder = ce.TargetEncoder(cols=['City'])
data = target_encoder.fit_transform(data, data['Sales'])
print(data)
Output:
City Sales
0 8200.000 10000
1 7250.000 8000
2 6000.000 5000
3 7000.000 7000
4 9000.000 9000
As you can see, the target encoding technique has transformed the "City" variable into numerical values representing the mean sales for each category. For example, the mean sales for New York is 10000, so the target encoded value for New York is 8200. Similarly, the mean sales for Los Angeles is 8000, so the target encoded value for Los Angeles is 7250, and so on. This type of encoding can be useful when the categorical variable has a strong correlation with the target variable and can provide additional information to the model.
Conclusion
Categorical variables are an essential component of data analysis and machine learning models. They provide valuable information that numerical variables cannot, and they are useful in creating features that can be used to predict outcomes in a machine learning model. Handling categorical variables requires careful consideration of their type and the available encoding techniques. One-hot encoding, label encoding, ordinal encoding, and target encoding are some of the popular encoding techniques used to handle categorical variables. Hope you got some value out of this article Subscribe to the newsletter to get more such blogs.
Thanks :)
Subscribe to my newsletter
Read articles from Rhythm Rawat directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Rhythm Rawat
Rhythm Rawat
Machine learning enthusiast with a strong emphasis on computer vision and deep learning. Skilled in using well-known machine learning frameworks like TensorFlow , scikit-learn and PyTorch for effective model development. Familiarity with transfer learning and fine-tuning pre-trained models to achieve better results with limited data. Proficient in Python, Machine Learning, SQL, and Flask.