On the value of preprocessing and feature engineering

Edward TianEdward Tian
5 min read

While I worked in the data science department of a large insurance carrier, the VP of Data Science once said to me, "Everyone has the same math."

The intent of this statement was to highlight that everyone gets distracted by sexy data science models. Neural networks with attention mechanisms, gradient-boosted tree-based algorithms, and all of the other buzzwords can make you feel invincible. However, the true value of data science, what sets your company apart from other companies, is rarely this.

Any person with a week of coding experience can import a library and unleash a model on a dataset. What makes or breaks a model's performance is the underlying data. This is why this VP's first 2 years of tenure was spent going over everything - contracts from third parties, software we utilize, workflow processes - to make sure that we are generating all of the data that we need to get work done, to make sure that we get to keep the data that we generate, and to make sure that the data provenance remained intact.

In addition, even if two companies have the same data, there are two avenues to gaining a competitive advantage in terms of model performance:

The obvious first option is to use a "better" model, or a model which is more flexible to the inputs. Neural networks are a classic example of this - they can parse information and learn patterns that are difficult to parse otherwise from something like a generalized linear model. The downsides to trying to use a "better" model is that there's always a tension between explainability and performance. Chances are, by moving to a more complex model, your understanding of the problem and how the model works becomes more and more like a black box.

The second option, however, is to make the data easier to learn from, from a model's perspective. This is feature engineering. Feature engineering is applying transformations to the raw data which can either generate new information for the model, or direct its attention towards the concept you're trying to get it to learn from. I'll give a few examples from my work experience.

Geospatial Data

I was doing a project that involved geospatial analytics. Much of the data utilized included longitude and latitude coordinates. If you had a dataset with AirBnB characteristics (one bed, two bed, square footage, etc), as well as latitudes and longitudes, and then fed that data directly into common data science models to predict AirBnB pricing, you'd find that the latitudes and longitudes would not rank high on feature importance.

However, if you took those latitudes and longitudes and converted them into distances from metro areas, they would become incredibly important. Or, instead, you could use a k-nearest neighbors approach, in which the distance metric is literally the distance from a test airbnb record from the nearest train airbnb record.

Both methodologies generate higher performing models. They also keep the same degree of explainability (arguably they actually increase explainability). They don't require additional information, but rather only need a deterministic pipeline to create the new feature.

Rock Property and Drilling Data

As another example, my current company takes cheap electronic drilling records and uses neural networks to convert them into expensive rock property data. However, we don't just take drilling data and rock property data, combine them together, and then unleash a neural network on them - we tried that initially and it doesn't do very well. Honestly, if it did perform well, then this product would be trouble because anyone and their mother would be able to do the same thing - why would they pay our company to do it for them?

Instead, there are things we do (obviously I cannot share them all) to make the learning process easier for our ML models. One example is to apply a bandpass filter on our signal data in order to isolate only the frequencies that are pertinent to what we want our model to learn. A more specialized example in this case involves something called tops data.

In the energy industry, rock property and drilling data also record the start of distinct layers of earth. The tops of these layers are called exactly that. Tops data records the depth at which the start of a new layer of the earth occurs. This is a static piece of data that changes as you move around laterally across the surface of the Earth.

In our case on well logging, we can utilize this data to pass information to the model on which layer we're at, and how "far" we've progressed in that layer. These tops serve as landmarks or checkpoints for the model, so that it can specialize its learning within each of these layers. This approach yielded much more performant models for well log generation.

Overall, in my experience, I've observed feature engineering as being a much more feasible way to generate competitive value in a data science team, and an often underestimated or underutilized tool in a data scientist's toolkit. However, I don't think it's underutilized because people simply undervalue it. I think a large contributor to its underutilization is because feature engineering requires doing the legwork.

On Domain Knowledge

This legwork, and the last thing I want to touch on, is the value of domain knowledge. A former boss and mentor repeatedly told me that "domain knowledge is data science gold." Each of the above examples of feature engineering requires the data scientist to have a strong understanding of what the data represents. If the columns were renamed to integers and scaled between 0 and 1, you wouldn't be able to do the same feature engineering. This is why I don't like data science hackathons where companies provide anonymized, coded data - it ends up being a contest for who can throw the most math at the same dataset while mindlessly creating as many different combinations of columns in an attempt to shotgun engineer features.

As a data scientist, I always try to become familiar with the data I am working with. Sometimes, that's by doing some research online. Other times, it requires that I schedule some time with a subject matter expert and ask them a bunch of questions that'll make me look foolish or uneducated - but it doesn't matter. Ultimately, what'll truly make me look foolish is how my project turns out.

0
Subscribe to my newsletter

Read articles from Edward Tian directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Edward Tian
Edward Tian