At the end of my econometrics and time series analysis course, we had a project assigned to us, where we needed to analyze a data set to understand the effect of drunk driving laws on traffic deaths. Honestly, this portion of the course was the most eye-opening and educational part of the course – mainly because much of the material that was presented in the class was content that I had been exposed to during my undergraduate time at Rice (although the review was very welcome, much of my foundational econometrics knowledge was powered by recognition and not recall prior to this course).

Data Set

The data given to us was structured as panel data – that is data for several unique entities over a period of time. In this case, each entity was a state (for a total of 48 states, since the data excluded Alaska and Hawaii), and the period of time was from 1982 through 1988 on an annual interval. For each entity, and each time period, we were given the following variables below, complete with the data definitions:

Variable	Descriptions
state	State ID (FIPS) Code
year	Year
spircons	Per Capita Pure Alcohol Consumption (Annual, Gallons)
unrate	State Unemployment Rate (%)
perinc	Per Capita Personal Income ($)
beertax	Tax on Case of Beer ($)
sobapt	% Southern Baptist
mormon	% Mormon
mlda	Minimum Legal Drinking Age (years)
dry	% Residing in Dry CountiesA dry county is a county whose government forbids the sale of any kind of alcoholic beverages. Some prohibit off-premises sale, some prohibit on-premises sale, and some prohibit both.
yngdrv	% of Drivers Aged 15-24
vmiles	Ave. Mile per Driver
jaild	Mandatory Jail Sentence
comserd	Mandatory Community Service
allmort	# of Vehicle Fatalities (#VF)
mrall	Vehicle Fatality Rate (VFR) – # deaths in given state in given year per 10k ppl living in that state that year
allnite	# of Night-time VF (#NVF)
mralln	Night-time VFR (NFVR)
allsvn	# of Single VF (#SVN)
a1517	#VF, 15-17 year olds
mra1517	VFR, 15-17 year olds
a1517n	#NVF, 15-17 year olds
mra1517n	NVFR, 15-17 year olds
a1820	#VF, 18-20 year olds
a1820n	#NVF, 18-20 year olds
mra1820	VFR, 18-20 year olds
mra1820n	NVFR, 18-20 year olds
a2124	#VF, 21-24 year olds
mra2124	VFR, 21-24 year olds
a2124n	#NVF, 21-24 year olds
mra2124n	NVFR, 21-24 year olds
aidall	# of alcohol-involved VF
mraidall	Alcohol-Involved VFR
pop	Population
pop1517	Population, 15-17 year olds
pop1820	Population, 18-20 year olds
pop2124	Population, 21-24 year olds
miles	total vehicle miles (millions)
gspch	GSP Rate of ChangeThis is a measure of economic growth

Approaching the data

If you’ve scrolled down this far, you know that there are quite a few variables to consider here, so how do we make sense of all of it? Do we try to clean the data? Do we look for outliers? Do we run summaries on each variable to get a better understanding of their mean values, standard deviations, and distributions? No. Since we were asked to run a linear regression, we need to start by organizing this list and understanding what kinds of variables we have. Unless my professor was being incredibly nice and the data was cooked, we won’t be including EVERY variable in our final model.

So how do we define variables? Categorical vs Binary vs Ordinal vs etc? Think simpler – I just wanted to separate out our variables into potential dependent variables and potential independent variables in our linear regression. I noticed with many of my classmates that, when running linear regression, it’s very easy to fall into the trap of correlation = causation. This is because, by going into the mindset of linear regression (or even just regular mathematical functions) that all of our Xs are independent, and our Y variable is the dependent variable. It’s funny but sad to see a linear regression run, with strong p values and a good adjusted r-squared value (and any other measure among an endless list of measures that people use to assess their models), but where the person building the model botched up their independent and dependent variable. At the end of the day, the model needs to be built based on the analyst’s hypothesis on how the world works (although this can be wrong very often – ever heard of the Cobra effect?).

So I sorted the variables into dependent and independent variables. Well actually, it’d be better to say potential Y variables for the model and potential X variables for the model, because some of the “independent variables” would really just be explanatory variables, since they’re measures that we cannot control (for example, we can control taxes on beer or the minimum legal drinking age, but we cannot control what year it is or what percentage of the state identifies as mormon). It was pretty simple, given the goal of the project – all of the variables that measured vehicle fatality rates were potential dependent variables. Among the rest of the variables, only 5 were truly independent variables: tax on beer, minimum legal drinking age, mandatory jail sentence time, mandatory community service time, and (technically, but arguably an explanatory variable) percentage of the state residing in dry counties.

Cleaning the data

After running summary statistics for all of the variables, we found that two of them (mandatory jail sentence time and mandatory community service time) were missing two observations, both in the state of California, both in the year 1988. After a quick Google, it was clear that California in 1988 was quite an ambiguous time, so we’ll cut them some slack on this. Since we’re dealing with panel data with many entities over many years (a kind of data structure that is quite robust, with the right analysis techniques), it shouldn’t be a big issue.

We also ran scatterplots within states, to view dependent variables over time, just to make sure most states weren’t behaving erratically year over year, indicating either a compromised data set or some very interesting analysis for later.

Looking at the data

Next, we compared independent variables across time within states to get a good understanding of when which laws changed for which states. This is important, because at the end of the day, we want to capture the effect of these independent variables on vehicular death rates. If the linear regression sees little to no change in these independent variables, the model’s going to be garbage.

We also compared explanatory variables across states within the same time period. The purpose of doing this was to see which states were different in demographic aspects – it’d be foolish to assume that a minimum legal drinking age affects all states with widely varying demographics in the same way – if a state had very few people young enough to be affected by the minimum legal drinking age, shouldn’t we expect that state to be less affected by a change in the minimum legal drinking age? We were trying to look for the following segments:

States with young demographics
- We wanted to take note of which states had younger demographics because of what I said above – they’d probably be more affected by a change in the minimum legal drinking age
States that had high vehicle mileage rates
- States that have low vehicle mileage rates (perhaps because of high participation in public transportation) may show less of an effect from changes to our independent variables since they may have fewer rates of vehicle fatalities in general (not necessarily true, but a consideration)
- It could also be that states that drive more have people that are overall more experienced drivers, which could either lead them to drive more responsibly or irresponsibly, or not have an effect at all on the vehicle fatality rate
States that had higher consumption of beer
- Given that all of our independent variables relate to drunk driving laws, if a state has low alcohol consumption in general, it would make sense that changes to drunk driving laws would have less of an effect on their vehicle fatality rates

We also wanted to note which states were richer vs poorer, which states were more religious, which states just had a higher overall population, etc. It should be easy to gather what distinctions we tried to spot, given that they’re all based on variables we defined as explanatory variables.

Modeling

After getting a better understanding of our data set, we decided to split our modeling efforts three ways – one for each group member. The three theoretical models tackled these three questions:

What is the effect of a change in the minimum legal drinking age on vehicle fatality rates?
What is the effect of a change in beer tax rates on vehicle fatality rates?
1. What is the effect of a mandatory jail or community service consequence on vehicle fatality rates?

Before I finally post the final report write-up for the analysis, I wanted to take a moment to talk about this project as a whole as an educational tool. What I appreciated about this project was that, while it did not necessarily represent the way we would deal with analytics in real life (i.e. well-defined goal, constraint on model type, relatively clean data set, data definitions, etc), it provided us with an opportunity to see WHY people are being paid so much to do analytical work; when working on analyses and conclusions with my group partners, we all arrived at different results and different conclusions. Sometimes, it was because of a mistake someone made – perhaps their theoretical model structure was incorrect, and they were including or excluding variables they shouldn’t have. Sometimes, it just ended up being a question of whose approach had numbers that looked better for our professor. Analyzing this data set wasn’t just simply throwing a bunch of variables into a computer, asking it to run a regression, and then spouting numbers onto a report – there were many steps along the way where you can either mess up, or diverge into two different, but still relatively correct paths.

Econometric Analysis of Drunk Driving Laws