I just wanted to write down some quick thoughts about the value domain knowledge and feature engineering.

My recent work involved using several very large data sources. One data source stored "failure to deliver" and "package not received" data from mailing sources. Another had reported incidents, such as crime or accidental fires.

We wanted to explore the value of these datasets in their predictive or explanatory power with various kinds of losses for the commercial buildings in our book. An example hypothesis might be that commercial buildings that were located near homes that had accidental fires might also be more prone to fire issues - whether it's the cultural behavior of the local area, the level of enforcement on fire code, or the physical risk of a fire spreading from a nearby building to the one we underwrite, etc.

All of these datasets had location data embedded in them - sometimes in the form of addresses, and other times in the form of latitudinal and longitudinal data. These geographic coordinate systems can also vary - some are expressed in degrees, minutes and seconds. Others in degrees and decimal minutes, others in decimal degrees.

In addition, these coordinates can be expressed in different geodetic datums, or coordinate reference systems. These datums use different points of reference to determine their coordinate values. For example, WGS84 is a geodetic reference system which bases its references on a mathematical model of the earth's shape. On the other hand, NAD83 is a North American geodetic reference system based on a network of survey markers. This means that for your exact location, your latitude and longitude coordinates can differ in value based on which datum is being used.

Thankfully, I discovered a delightful Python package called geopandas. This library allowed me to go through the data and standardize all coordinates into one unified set of units under one unified datum. It required some trial and error on my part - some data sources did not store the information on which coordinate reference system their coordinates utilized, but by having some common information (e.g. addresses for properties that overlap with our book of insurance), I could plot and examine to determine which reference system was being utilized for each data source.

I did not magically have all of this knowledge about geospatial systems when I started this project. I took the initiative to do some learning on my own, as well as schedule some time with subject matter experts available within my company. My manager repeatedly says that having the domain knowledge to better understand the data you're working with is "data science gold."

In this case, had I not spent that time understanding the data and how it can vary, almost any task that I attempted to do - whether it was descriptive analytics, predictive analytics, or prescriptive analytics - they would all be using mixed up data that would result in the conclusion that the data is useless even though the data does indeed provide value to the business.

On Domain Knowledge Value and Geospatial Analytics

Subscribe to my newsletter

Edward Tian

Edward Tian