Optimal Number of Features in Data Analysis

Let me just give the answer away in the first sentence: “No amount of feature is too much features.” I know it's sad to hear, but stay with me for a while... I would answer this question with questions to give you clarity.

Feature selection involves choosing a subset of relevant features from the original feature set; by doing so, it can significantly impact model performance, reduce computational cost, and improve model interpretability.

This was a big problem for me because I thought the more data, the better the predictive capabilities of your model….

But no! not every time!

The more useless data (outliers)/features, the farther from the original result you become.

Does the feature intuitively make sense in influencing the target variable?

Now you have to first trust your judgment on the topic of what is to be predicted and make a long list of all the features you feel would affect the accuracy of your prediction.

Let's say you're building a model to predict customer churn for a telecommunications company.

Relevant Features:

Monthly usage (data, minutes, texts)
Customer tenure
Number of customer service calls
Contract type
Payment history
Demographic information (age, location)

Irrelevant Features:

Customer's favorite color (unlikely to have a significant impact on churn)
Customer's social security number (irrelevant and potentially sensitive)

Does the feature have a strong correlation with the predicted variable?

While making this list, keep in mind that the feature should have a statistically significant relationship with the target variable. This can be assessed using correlation coefficients (for numerical features), chi-squared tests (for categorical features), or other statistical tests.

What is the data quality?

Features with a high percentage of missing values and outliers can be problematic. In this case, you can consider imputation methods or removing the feature entirely if the missing values are substantial.

Ensure the feature's data type is appropriate for the chosen machine learning model (e.g., numerical, categorical, boolean).

Can you merge some features?

In some cases, some features can be merged into one major feature to reduce training time and create simpler models that are easier to understand and explain. e.g., creating a "price per square foot" feature from "price" and "square footage

This process is called feature engineering, and it further involves manipulating and transforming raw data into features that can be used to effectively train machine learning models.

Are features that are easily interpretable in the context of the problem domain?

Simplicity is key.

How easy it is to understand and explain the reasoning behind a model's predictions is heavily dependent on your stakeholders (those who you are doing the prediction/forecast for).

This problem usually comes up after complex combinations of features generated through feature engineering. While these features might improve model accuracy, they may be difficult to understand and explain to stakeholders.

Not Too Many, Not Too Few, Just Right-”Lagom”

So to answer your question, “How many features are enough features?” I would say not too many, not too few, just right.

Think of your data like a recipe. Too many ingredients can make a delicious mess! Feature selection is like carefully choosing the right spices. You want the flavors to complement each other, not overwhelm the dish. Ask yourself: Does this ingredient (feature) truly add something special? Does it play well with the others? And most importantly, can I easily explain this culinary masterpiece to my dinner guests (stakeholders)?"

How many features are enough features

Table of contents