Econometric Analysis of Drunk Driving Laws


At the end of my econometrics and time series analysis course, we had a project assigned to us, where we needed to analyze a data set to understand the effect of drunk driving laws on traffic deaths. Honestly, this portion of the course was the most eye-opening and educational part of the course – mainly because much of the material that was presented in the class was content that I had been exposed to during my undergraduate time at Rice (although the review was very welcome, much of my foundational econometrics knowledge was powered by recognition and not recall prior to this course).
Data Set
The data given to us was structured as panel data – that is data for several unique entities over a period of time. In this case, each entity was a state (for a total of 48 states, since the data excluded Alaska and Hawaii), and the period of time was from 1982 through 1988 on an annual interval. For each entity, and each time period, we were given the following variables below, complete with the data definitions:
Variable | Descriptions |
state | State ID (FIPS) Code |
year | Year |
spircons | Per Capita Pure Alcohol Consumption (Annual, Gallons) |
unrate | State Unemployment Rate (%) |
perinc | Per Capita Personal Income ($) |
beertax | Tax on Case of Beer ($) |
sobapt | % Southern Baptist |
mormon | % Mormon |
mlda | Minimum Legal Drinking Age (years) |
dry | % Residing in Dry CountiesA dry county is a county whose government forbids the sale of any kind of alcoholic beverages. Some prohibit off-premises sale, some prohibit on-premises sale, and some prohibit both. |
yngdrv | % of Drivers Aged 15-24 |
vmiles | Ave. Mile per Driver |
jaild | Mandatory Jail Sentence |
comserd | Mandatory Community Service |
allmort | # of Vehicle Fatalities (#VF) |
mrall | Vehicle Fatality Rate (VFR) – # deaths in given state in given year per 10k ppl living in that state that year |
allnite | # of Night-time VF (#NVF) |
mralln | Night-time VFR (NFVR) |
allsvn | # of Single VF (#SVN) |
a1517 | #VF, 15-17 year olds |
mra1517 | VFR, 15-17 year olds |
a1517n | #NVF, 15-17 year olds |
mra1517n | NVFR, 15-17 year olds |
a1820 | #VF, 18-20 year olds |
a1820n | #NVF, 18-20 year olds |
mra1820 | VFR, 18-20 year olds |
mra1820n | NVFR, 18-20 year olds |
a2124 | #VF, 21-24 year olds |
mra2124 | VFR, 21-24 year olds |
a2124n | #NVF, 21-24 year olds |
mra2124n | NVFR, 21-24 year olds |
aidall | # of alcohol-involved VF |
mraidall | Alcohol-Involved VFR |
pop | Population |
pop1517 | Population, 15-17 year olds |
pop1820 | Population, 18-20 year olds |
pop2124 | Population, 21-24 year olds |
miles | total vehicle miles (millions) |
gspch | GSP Rate of ChangeThis is a measure of economic growth |
Approaching the data
If you’ve scrolled down this far, you know that there are quite a few variables to consider here, so how do we make sense of all of it? Do we try to clean the data? Do we look for outliers? Do we run summaries on each variable to get a better understanding of their mean values, standard deviations, and distributions? No. Since we were asked to run a linear regression, we need to start by organizing this list and understanding what kinds of variables we have. Unless my professor was being incredibly nice and the data was cooked, we won’t be including EVERY variable in our final model.
So how do we define variables? Categorical vs Binary vs Ordinal vs etc? Think simpler – I just wanted to separate out our variables into potential dependent variables and potential independent variables in our linear regression. I noticed with many of my classmates that, when running linear regression, it’s very easy to fall into the trap of correlation = causation. This is because, by going into the mindset of linear regression (or even just regular mathematical functions) that all of our Xs are independent, and our Y variable is the dependent variable. It’s funny but sad to see a linear regression run, with strong p values and a good adjusted r-squared value (and any other measure among an endless list of measures that people use to assess their models), but where the person building the model botched up their independent and dependent variable. At the end of the day, the model needs to be built based on the analyst’s hypothesis on how the world works (although this can be wrong very often – ever heard of the Cobra effect?).
So I sorted the variables into dependent and independent variables. Well actually, it’d be better to say potential Y variables for the model and potential X variables for the model, because some of the “independent variables” would really just be explanatory variables, since they’re measures that we cannot control (for example, we can control taxes on beer or the minimum legal drinking age, but we cannot control what year it is or what percentage of the state identifies as mormon). It was pretty simple, given the goal of the project – all of the variables that measured vehicle fatality rates were potential dependent variables. Among the rest of the variables, only 5 were truly independent variables: tax on beer, minimum legal drinking age, mandatory jail sentence time, mandatory community service time, and (technically, but arguably an explanatory variable) percentage of the state residing in dry counties.
Cleaning the data
After running summary statistics for all of the variables, we found that two of them (mandatory jail sentence time and mandatory community service time) were missing two observations, both in the state of California, both in the year 1988. After a quick Google, it was clear that California in 1988 was quite an ambiguous time, so we’ll cut them some slack on this. Since we’re dealing with panel data with many entities over many years (a kind of data structure that is quite robust, with the right analysis techniques), it shouldn’t be a big issue.
We also ran scatterplots within states, to view dependent variables over time, just to make sure most states weren’t behaving erratically year over year, indicating either a compromised data set or some very interesting analysis for later.
Looking at the data
Next, we compared independent variables across time within states to get a good understanding of when which laws changed for which states. This is important, because at the end of the day, we want to capture the effect of these independent variables on vehicular death rates. If the linear regression sees little to no change in these independent variables, the model’s going to be garbage.
We also compared explanatory variables across states within the same time period. The purpose of doing this was to see which states were different in demographic aspects – it’d be foolish to assume that a minimum legal drinking age affects all states with widely varying demographics in the same way – if a state had very few people young enough to be affected by the minimum legal drinking age, shouldn’t we expect that state to be less affected by a change in the minimum legal drinking age? We were trying to look for the following segments:
States with young demographics
- We wanted to take note of which states had younger demographics because of what I said above – they’d probably be more affected by a change in the minimum legal drinking age
States that had high vehicle mileage rates
States that have low vehicle mileage rates (perhaps because of high participation in public transportation) may show less of an effect from changes to our independent variables since they may have fewer rates of vehicle fatalities in general (not necessarily true, but a consideration)
It could also be that states that drive more have people that are overall more experienced drivers, which could either lead them to drive more responsibly or irresponsibly, or not have an effect at all on the vehicle fatality rate
States that had higher consumption of beer
- Given that all of our independent variables relate to drunk driving laws, if a state has low alcohol consumption in general, it would make sense that changes to drunk driving laws would have less of an effect on their vehicle fatality rates
We also wanted to note which states were richer vs poorer, which states were more religious, which states just had a higher overall population, etc. It should be easy to gather what distinctions we tried to spot, given that they’re all based on variables we defined as explanatory variables.
Modeling
After getting a better understanding of our data set, we decided to split our modeling efforts three ways – one for each group member. The three theoretical models tackled these three questions:
What is the effect of a change in the minimum legal drinking age on vehicle fatality rates?
What is the effect of a change in beer tax rates on vehicle fatality rates?
- What is the effect of a mandatory jail or community service consequence on vehicle fatality rates?
Before I finally post the final report write-up for the analysis, I wanted to take a moment to talk about this project as a whole as an educational tool. What I appreciated about this project was that, while it did not necessarily represent the way we would deal with analytics in real life (i.e. well-defined goal, constraint on model type, relatively clean data set, data definitions, etc), it provided us with an opportunity to see WHY people are being paid so much to do analytical work; when working on analyses and conclusions with my group partners, we all arrived at different results and different conclusions. Sometimes, it was because of a mistake someone made – perhaps their theoretical model structure was incorrect, and they were including or excluding variables they shouldn’t have. Sometimes, it just ended up being a question of whose approach had numbers that looked better for our professor. Analyzing this data set wasn’t just simply throwing a bunch of variables into a computer, asking it to run a regression, and then spouting numbers onto a report – there were many steps along the way where you can either mess up, or diverge into two different, but still relatively correct paths.
Subscribe to my newsletter
Read articles from Edward Tian directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
