Confidence intervals for random forest partial plots

Everyone loves random forests (RFs).  The algorithm is powerful, intuitive, and easily implemented in most languages and statistical platforms.  However, a common critique is that it is difficult to extrapolate substantive effects from a trained random forest.  This is especially true amongst the academic crowd who are used to coefficients and magical ***’s.  Alas, R’s randomForest package and Python’s sklearn both provide “partial plot” methods, which demonstrate the substantive effects on an independent variable (IV) on the dependent variable (DV).  But, there are two key issues with the resulting plots. First, they are ugly.  Second, they do not provide confidence intervals (CI’s), which can make interpretation difficult relative to GLM-based marginal effects plots.

Taken directly from the randomForest partialPlot documentation: “The function being plotted is defined as (the function below) where x is the variable for which partial dependence is sought and x_{1,C} is the other variables in the data.”

\widetilde{f}=\frac{1}{n}\sum\limits^{n}_{n=1} f(x, x_{1,C})

Below is a crude representation for how the partialPlot function works.  Consider the data below (Y=dependent variable, var_1, var_2, and var_3 are independent variables):

y var_1 var_2 var_3
12 4 7 1
18 6 1 4
20 18 4 2
14 16 4 7

Step 1: Train a random forest on the data.

Step 2: Choose an independent variable of interest.  For this example, let’s choose var_1.

Step 3: Build a list out of unique values in var_1.  For this example, list = [4,6,16,18].

Step 4: Start with the first element in the list (4, in this case), and set all values in the column chosen in Step 2 to that element.  Then, use the trained algorithm object from Step 1, and build a prediction for each row.  The table below illustrates the output of Step 4.  Note that all values of var_1 are set to the first element in the list for the first iteration.

prediction var_1 var_2 var_3
14.2 4 7 1
22.4 4 1 4
14.5 4 4 2
17.6 4 4 7

Persist the element value (4, in this case) and the mean of the predictions.

Step 6: Iterate through all elements in the list, and then plot.  In the plot, the list values are x coordinates, the mean of the predictions form the y coordinates.

Although this approach can uncover basic trends in the marginal effects of an independent variable of interest, it prevents a user from determining the variance of the suggested effects.  This can be problematic for all of the reasons that ignoring variance around predictions can be problematic.

The R code at the bottom of the page provides a function called plot_partial, which adds confidence interval functionality to randomForest partialPlots.  Essentially, it just removes the \frac{1}{n}\sum\limits^{n}_{n=1} portion of the function, and leverages the full vector of predictions.

First, let’s examine the output of the partialPlot function, first with a continuos IV, then with a categorical IV.

>airquality <- na.omit(airquality)
>rf_1 <- randomForest(Ozone ~ ., airquality)
>partialPlot(rf_1, airquality, Temp)

>rf_iris<-randomForest(Sepal.Length ~., iris)
>partialPlot(rf_iris, iris, Species)

No need to belabor the point, but these plots are ugly and ignore prediction variance. Here are the same partial plots, only this time using the partial_plot function provided in the code at the bottom of the page.

> plot_partial(rf=rf_1, data=airquality, dv=”Ozone”, iv=”Temp”, conf_int_lb=.25, conf_int_ub=.75)

>plot_partial(rf_iris, iris, dv=”Sepal.Length”, iv=”Species”, conf_int_lb=.27, conf_int_ub=.75)

Arguments for the partial_plot function:

rf = a trained random forest object
data = a data frame in the same feature space as the data frame used to train the random forest object. Often, this will be the same data set used in training.
dv = the name of the dependent variable, provide in ” ” .
iv = the name of the independent variable of interest, provided in ” ” .
conf_int_lb = the percentile of predictions to plot as the lower bound.
conf_int_ub = the percentile of predictions to plot as the upper bound.
num_sample = equivalent of If iv is continuous, the number of values to choose for evaluating partial dependence. Default = NULL.
delta = whether to plot the change in prediction between the actual iv values and the simulated values. Default = NULL.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
rf_predict<-function(rf_object, data){
if (rf_object$type=="classification"){
p <-predict(rf_object, data, type="prob")
} else {
p <-predict(rf_object, data)
return (p)
function(rf, X, dv, iv, conf_int_lb=.25,
conf_int_ub=.75, range_low=NULL,
range_high=NULL, delta=FALSE, num_sample=NULL)
if (is.factor(X[, iv_name])==TRUE){
factor_var<-unique(iris[, iv_name])
#the test set needs all factor levels. so, we build them and will drop them before we plot
factor_names <- attributes(factor_var)$levels
fix_factor_df[, iv_name]<-factor_names
y_hat_df <- data.frame(matrix(vector(),0, 2))
y_temp <- data.frame(matrix(vector(), nrow(X), 2))
y<-predict(rf, X)
for (i in 1:length(factor_names)){
X[, iv_name] <- factor_names[i]
X[, iv_name] <- factor(X[, iv_name])
X_temp<-rbind(X, fix_factor_df)
p<-rf_predict(rf, X_temp)
y_temp[,1]<-p[1:nrow(X)] #drop the fix_factor_df rows
if (delta==TRUE){
y_hat_df<-rbind(y_hat_df, y_temp)
plot<- qplot(y_hat_df[,2], y_hat_df[,1],
data = y_hat_df,
main = paste("Partial Dependence of", (iv_name), "on", (dv_name))) +
ylab(bquote("Predicted values of" ~ .(dv_name))) +
return (plot)
} else {
conf_int <-(conf_int_ub-conf_int_lb)*100
temp<-sort(X[, iv_name])
if (is.null(num_sample)==FALSE){
temp<-sample(temp, num_sample)
if (is.null(range_low)==FALSE & is.null(range_high)==FALSE){
low_value<-quantile(temp, range_low)
high_value<-quantile(temp, range_high)
temp<-temp[temp<high_value & temp>low_value]
y<-rf_predict(rf, X)
for (i in 1:length(temp)){
X[, iv_name] <- temp[i]
y_hat<-rf_predict(rf, X)
if (delta==TRUE){
y_hat_lb[i]<-quantile(y_hat, conf_int_lb)
y_hat_ub[i]<-quantile(y_hat, conf_int_ub)
df_new<, y_hat_mean, y_hat_lb, y_hat_ub))
plot<- ggplot(df_new, aes(temp)) +
geom_line(aes(y=y_hat_mean), colour="blue") +
geom_ribbon(aes(ymin=y_hat_lb, ymax=y_hat_ub), alpha=0.2) +
geom_rug(aes()) +
xlab(iv_name) +
ylab(bquote("Predicted values of" ~ .(dv_name))) +
ggtitle(paste("Partial Dependence of", (iv_name), "on", (dv_name), "\n with", (conf_int), "% Confidence Intervals"))
return (plot)
view raw gistfile1.r hosted with ❤ by GitHub

Data and code to build district-month predictions of future violence in Afghanistan

Earlier this week, I posted an article about building predictions of future levels of violence at the district-month level for Afghanistan.  Here is there data and the code is below.  Note that it takes about 30 minutes (not 5) to run.  Also note that it generates LOTS of errors since many of the time-series have long runs of 0′s, but eventually these get predicted to be 0′s, so it’s all good.  Let me know if you would like the province- or country-level data or code.  Efficiency wasn’t really my goal with this code (it’s crude and clunky but achieves good predictive accuracy), the loop structure is slow but should be pretty easy to follow.  Please let me know if anything doesn’t make sense.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
##programmer: JEY
##District-month forecasts
##note: this takes about 5 minutes to run. It throws lots of errors due to many all zero time series, but it will eventually complete and build predictions
for (k in start:end){
error<-matrix(data=0, nrow=317, ncol=3)
colnames(error)<-c("true", "naive", "arfima")
for (i in 1:317){
test<-subset(data, newid==i &monthly<month)
unif<-runif(length(a), min=0, max=.1)
a<-a+unif #this minor addition allows the arfima to converge
y_hat<-forecast(fit, h=1)
true<-subset(data, newid==i &monthly==month) #actual leve of violence
naive<-subset(data, newid==i &monthly==(month-1)) #naive predict of t-1
error<-round(error, digits = 0)
compare<-cbind(error.naive, error.arfima)
results<-cbind(compare, dif)
colnames(results)<-c("error.naive", "error.arfima", "difference")
write.csv(results, "results_district.csv")
view raw gistfile1.r hosted with ❤ by GitHub

The Effects of Intra-state Conflict on Interstate Conflict: An Analysis of GDELT

The release of the GDELT dataset has been receiving a lot of attention, and rightfully so (see Foreign Policy’s write up here, Jay Ulfelder’s write up here, and Phil Schrodt and Kalev Leetaru’s official release paper here).  Last week, I posted a chapter of my dissertation that used GDELT to build predictions of violence in Afghanistan (see the posts below).  Below is a chapter that I wrote that uses GDELT to provide a rigorous, dyad-month level analysis of the effects of domestic conflict on interstate conflict.

The Effects of Domestic Conflict on Interstate Conflict: An Event Data Analysis of Monthly Level Onset and Intensity 

A Tale of Two Ph.D.s

On Friday, Slate published an article called “Thesis Hatement: Getting a Literature Ph.D. will turn you into an emotional trainwreck, not a professor”, written by Rebecca Shuman.  Since I don’t know Shuman or anything about a humanities Ph.D. program, I’ll withhold judgments that I would otherwise be inclined to make.  I just figured I’d share a bit about my experience getting a Ph.D., since it is pretty much the exact opposite of Shuman’s.

I began the Ph.D. program in the department of political science at Penn State in the fall of 2009 with no prior graduate schooling.  Penn State, like most major universities, covers all tuition for students accepted into the Ph.D. program.  It also provides a stipend that covers the basic cost of living. Early on in my first semester, I told the faculty that my intention was to finish the Ph.D. and then work for either the government or enter the private sector.  Reactions ranged from neutral to highly supportive.

Last month, I completed my Ph.D.,  a little over 3 and a half years since I started.  During that time, I took nearly a dozen methodological courses, four of which were outside the department but still fully funded.  I taught an undergraduate class, did outside consulting work for the government, presented at conferences, and got a few publications.  I also wasted a hell of a lot of time, and not in Shuman’s “grad school is a waste of time” sense, but in the “going to bars to play pool and drink $5 pitchers of Bud Light” sense, meaning that I certainly could have been more productive.

The training I received as a Ph.D. student qualified me for a host of jobs, ranging from think tanks to government to academia to the private sector.  In January, I accepted a job as a data analyst with Allstate Insurance and started in March.

I must acknowledge that I was lucky, since my department gave me (and the other students in the program) considerable freedom and support to pursue whatever interests.  I also had an incredibly good advisor.  I have no idea if my experience is indicative of other social science Ph.D. program and am not claiming that it is.  All I know is that for me, the process was rewarding, fairly fast, and led directly to a good job.

Ok, so I lied –  I will pass judgment on one point Shuman makes.  She claims that a humanities BA is among the most hireable, which is just flat out wrong (note that the supporting article was written in 1997, when Zuckerberg  was in middle school, Jobs was just beginning his second stint at Apple, and Bieber was 3.  A few things have changed since then.)  If you don’t believe me, how about a friendly wager?  Call up the career services of a few major universities.  Ask them if an undergraduate majoring in computer science with a minor in economics has a better chance of getting a job than an undergraduate majoring in philosophy with a minor in English literature.  I’ll bet any amount of $$$ that they pick the former.

Using GDELT to forecast violence in Afghanistan

The Global Dataset of Events, Location, and Tone (GDELT) is a new, 230 million (and growing daily) is the first ever machine-coded political event data dataset to provide information on event location.  For those attending ISA, Kalev Leetaru and Phil Schrodt will be formally introducing the GDELT dataset.  The full dataset will be publicly available soon, but for now you can access an older version here.

From a forecasting perspective, the benefits of a machine-coded dataset updated in (near) real-time that provides specific latitude-longitude coordinates are numerous.  In the first ever empirical analysis using GDELT (pdf of paper –> “Predicting Future Levels of Violence in Afghanistan Districts with GDELT“), I build an empirical model that predicts the level of conflict at the district-month level in Afghanistan.  Below is .gif that Joshua Stevens built using GDELT that reflects the distribution of conflict events in Afghanistan over time.

Moore’s Law and Event Data

In 1965, Gordon Moore predicted that the number of transistors on integrated circuits would double every 2 years.  By 1970, the term “Moore’s Law” was coined.  Since then, Moore’s Law has proven shockingly accurate in  not only its intended domain (transistors on chips) but across a number of other areas, such as hard disk storage and pixels (see Wikipedia and Microsoft’s take).  Recently, I found that Moore’s Law is also applicable to the number of observations in the largest political event data datasets.  Below are the key milestones in political event data.  Note that the size of WEIS is a general estimation since the original dataset no longer exists.

  • 1978 – World Event Interaction Survey (WEIS): 2,000 observations
  • 1996 – Kansas Event Data Set (KEDS): 225,000 observations
  • 2004 – 10 Million International Dyadic Event dataset: 10,000,000 observations
  • 2012 – Global Database of Events, Location, and Tone (GDELT): 220,000,000 observations
Below, these true values are graphed against what Moore’s Law would predict, knowing only that that largest dataset in 1978 was ~1,800 observations. For visual appeal, I plot using a logarithmic scale.

Although the accuracy of Moore’s law in the above graph is incredible, what is even more interesting is what this suggests about the future.  If the trend holds, the largest political event data dataset should be approaching 1 billion observations around 2016/2017.

A (slightly) better way to evaluate out-of-sample performance on TSCS data

The international relations conflict literature is dominated by time-series cross sections datasets with a binary dependent variable (henceforth referred to as BTSCS).   The majority of studies using BTSCS data use a logit/probit model to draw inferences on one or a few “key” variables of interest or compare models.   Although the most empirically justified way to actually test which model is better or the importance of a certain “key” variable through evaluating out-of-sample performance, alarmingly few scholars using BTSCS data actually do.  Beck et. al. (2000) is a primary exception and current “best practice” approach.  Here is a quick recap of their approach:

Using the Tucker (1997) dataset (BTSCS at the dyad-year from 1947 to 1989), Beck et. al. set 1960-1985 as the in-sample data and use 1986-1989 as the out-of-sample data to evaluative model performance.  This is not a bad approach and is infinitely better than purely relying on in-sample metrics.  However, this approach means that predictions for 1989 are based on a training model that does not include data from 1986, 1987, and 1988.  Thus, potentially valuable data is unnecessarily omitted.

With this in mind, I suggest an alternative “rolling” approach, which iteratively expands the training set by one temporal unit in a way that is conceptually similar to the idea of “online learning” in the machine learning world.  The benefits of the rolling approach are largely two-fold.  First, unlike Beck et. al’s method, the rolling model uses as much data as possible for each forecast.  Thus, whereas the Beck approach attempts to generate forecasts for 1989 using training data from just 1947 to 1985, my approach incorporates information from 1986, 1987, and 1988.    Second, since performance scores are calculated for each year rather than just on one chunk of years (i.e. for 1986, 1987, 1988, and 1999 individually, and opposed to just 1986-1999 cumulatively), researchers have more information to use when comparing models.

To facilitate discussion, assume we have a TSCS dataset at the state-year level with 100 years of data.  Below is the improved, “rolling” approach:

  1. Sort the TSCS data ascending by time.
  2. Generate count variable for each unique temporal unit.
  3. Set the first 75 observations as the training set and the 76th observation as the test set. Depending on the characteristics of your dataset, you may wish to start use anywhere from 50-80% of the data for initial training, instead of the recommended 75%.
  4. Estimate and store coefficients using the training set that maximize the likelihood function.
  5. Apply these coefficients to the variables from the 76th year to generate predicted probabilities and store measures of predictive accuracy.
  6. Repeat steps 3, 4, and 5 by yearly increments until we reach the 100th observation. i.e. retrain the model using the first 76 observations, store coefficients, then generate predicted probabilities for the 77th observation.

Below is some simple code in R that intuitively outlines how this works.  The code calculates and plots the out-of-sample AUC scores for the Grievance and Opportunity models in of Collier and Hoeffler (2004) (click here for the data):

The plow above shows that the “rolling” approach generates an AUC score for each time period, as opposed to a single, cumulative score for 1980 to 1995.  In this example, neither the Grievance nor Opportunity model appears to outperform the other, since their AUC scores overlap.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
##code for 1 unit ahead model evaluation of BTSCS data
##James E. Yonamine
##data from
##run time: 12 seconds
data <- read.dta("collier and hoeffler 2004.dta")
data<-data[complete.cases(data$warsa),] #this drops all rows with NA for warsa, i.e. drops all continuations. C&H use this approach, which is reasonable
a.out <- amelia(data, m = 1, ts = "year", cs = "country")
data<-cbind(data, data.frame(count=cumsum(c(TRUE,data$year[-1]!=data$year[-length(data$year)]))))
num<-data[nrow(data),ncol(data)] ##figure out how many unique time series you have
start = 4 ##this determine what portion of the data to use as the initial training set
step=1 ##how many steps ahead do you want to forecast
history <- matrix(data=0, nrow=(num-start), ncol=3)
colnames(history) <- c("year", "griev_1.auc", "opp_1.auc")
for (i in start:(num-1)) {
set.seed(i)<- data[which(data[,ncol(data)] <= i),] #set initial test set <- data[which(data[,ncol(data)] ==(i+step)),] #set initial trianing set to be one period ahead
grievance.1 <- glm(warsa ~ elfo + rf + pol16 + etdo4590 + dem + peace + mount + geogia + lnpop, family=binomial(link="logit"), data =
grievance.1.predict <- predict(grievance.1,, type="response")
## Establish prediction objects from ROCR package
with.predict <-prediction(grievance.1.y.hat,y)
## Calculate and store the AUC
with.auc <- performance(with.predict, measure = "auc")
history[(i-start)+1,2] <- as.numeric(unlist(slot(with.auc,"y.values")))
opportunity.1 <- glm(warsa ~sxp + sxp2 + coldwar + secm + gy1 + peace + prevwara + mount + geogia + frac + lnpop, family=binomial(link="logit"), data =
opportunity.1.predict <- predict(opportunity.1,, type="response")
## Establish prediction objects from ROCR package
without.predict <-prediction(opportunity.1.y.hat,y)
## Calculate and store the AUC
without.auc <- performance(without.predict, measure = "auc")
history[(i-start)+1,3] <- as.numeric(unlist(slot(without.auc,"y.values")))
### store the temporal component
history[(i-start)+1,1] <[1,4]
#####plot it####
yrange=range(ymin, ymax)
plot(history[,2], type="o", col="blue", xaxt="n", ylim=yrange, xlab="Year", ylab="AUC", main="Example Plot")
lines(history[,3], type="o", col="red")
axis(1, 1:nrow(history), lab=history[1:4,1])
legend(1, c("Grievance_1","Opportunity_1"),
col=c("blue","red"), pch=21, lty=1)
view raw gistfile1.r hosted with ❤ by GitHub

What can Kimbo Slice teach us about predictive models?

In the backyards of Miami, my money is on Bueno de Mesquita, every time.

I have tried very hard to keep my website/blog/twitter focused solely on my professional interests – using open source data to forecast socially driven outcomes.  Now, I’m going to bend the rules a bit and bring in my love of combat sports in order to apply lessons from mixed martial arts (MMA, or “cage fighting” or “ultimate fighting” to the layperson) to forecasting models.

Last Saturday, Fox aired “UFC on Fox 5”, which was undoubtedly the greatest collection of MMA talent ever aired on a free TV.  The main event – Nate Diaz vs. Benson Henderson – garnered 5.7 million views.  Not bad, but consider this: Kimbo Slice has fought 3 times on free TV, each time surpassing 6 million views (6.1, 6.45, and 7.1 million to be exact).  Until a string of losses that exposed him as a D level fighter, Kimbo was among the biggest draws in all of combat sports.  But why? I believe that it derives from our obsession with the mythical.  In the context of fighting, we seem to be drawn to someone who, for untraditional reasons, seems to be invincible.  In terms of MMA, the fighters who achieve mythical status share two things in common: 1) untraditional or secret training methods and 2) a string of dominant victories over easy opponents.  As I argue a bit later, the same is true of predictive models

In the early stages of MMA, the Gracie family dominated.  For many years, their brand of jiu-jitsu (submission fighting on the ground) allowed physically inferior men like Royce to appear invincible in actual no holds barred fighting.  Their highly technical method of ground fighting seemed like magic to the untrained eye, and since virtually no one in the United States was familiar with jiu-jitsu at the time, it just seemed like magic.  Eventually, the Gracie’s began to fight better competition; Kazushi Sakuraba tarnished the Gracie legacy by beating Royce, Renzo, Royler, and Ryan, and Matt Hughes drove the final nail in the coffin when he mauled Royce.

The Russian fighter Fedor Emelianenko regained and arguably surpassed a Gracie-level of mystique.  Like the Gracie’s, the specific details of his training regimen were not know outside of Russia, and the only glimpses of his training U.S. audiences saw were Youtube clips of minimalist workouts on Russian playgrounds.  He was (and likely still is) a secretive man who seemingly cared more about his religious devotion than fighting (his Orthodox priest often accompanied him to fights).  Also, like the Gracies, Emelianenko gained mythical status by beating a mix of legitimate, semi-legitimate, and absurdly illegitimate competition.

Kimbo Slice achieved event greater U.S. fame than the Gracie’s or Emelianenko through a series of backyard street fights in Miami that circulated on YouTube.  He destroyed lesser competition (i.e. people with no idea how to fight) in dramatic fashion (i.e. he once popped out someone’s eyeball. Seriously.).  No one really knew his fighting background or training, but people seemed to assume that he was so innately tough that it didn’t much matter.  By the time he fought James Thomson live on CBS in 2008, he had reached mythical status.  Of course, this came crashing down when he was knocked out in 14 seconds by unranked journeyman Seth Petruzelli.

So what does this all have to do with forecasting models?  As with MMA fighters, we seem to be intuitively drawn to black box, secret models.  Additionally, like in the fight game, models can artificially inflate their reputation by “beating up” on easy problems.

Like with out fighters, we seem to be intuitively drawn towards secret, mythical approaches to prediction.  Just consider the absurdity of the history of human forecasting: Oracles interpreting hallucinogenic dreams, astronomers staring at the heavens, monks pouring over ancient texts, etc..  My sense is that many still seem to think that a secret formula exists that can predict the world.  Many practitioners, like Bruce Bueno de Mesquita or the CIA, certainly help to propagate this myth (see the History Channel special that pitted Nostradamus vs. BDM).  I suspect if you surveyed 1,000 random people about whether academics using open source data and open-source statistical packages could build better forecasts of civil unrest than the CIA using only classified data and proprietary algorithms, most would pick the CIA. I’d bet anything on the open source team.  In predictive models as in fighting, there is no secret approach to success.

Additionally, similar to the way in which fighters achieve mythical status by often padding their records fighting easy competition, predictive models have a few tricks to “pad” their records as well.  First, and a favorite of the game theorists, is to use a model to make non-falsifiable predictions.  Thus, no matter what happens, the game theorists can make a strong case that his or her model correctly predicted it after the fact.  Second, the primary tool of empirically driven prediction models is to inflate results by predicting that tomorrow will look a lot like today.  For outcomes like whether or not a country will be at civil war tomorrow, this approach is correct 90+% of the time.

In reality, neither mythical fighters nor mythical predictive models exist.  Everyone knows exactly how the best fighters — Jon Jones, Anderson Silva, Georges St. Pierre — train.  These fighters get to that level by having slightly more genetic gifts, work slightly harder, and having slightly better coaches, all of which contribute to a minor advantage over their opponents come fight time.  This is almost identical to the predictive model world.  In actual, objective tests of predictive accuracy (such as Kaggle competitions), the wining models tend to only narrowly beat the competition and the methodological approaches tend to be highly similar.  The reality of predictive models is just like that of fighting: there are no mythical approaches. The winning teams tend to put in slightly more time on slightly more powerful machines with teams comprised of slightly more experienced modelers.

With fighters as with predictive models, if it seems to good to be true, it almost certainly is.


3 things to pay attention to when analyzing predictive accuracy

{{insert obligatory Nate Silver reference to connect the content of this post to current events}}

As is readily obvious from the content of this blog, I think and write frequently about predictions, and I constantly advocate that the only way to test whether a person or model (of the non-person type, unfortunately) actually helps us better “understand” the world is to evaluate how well he/she/it can predict. Despite the emphasis of predictive accuracy, there are a number of complications to keep in mind when evaluating a model’s predictive accuracy that tend to be overlooked.  Here are three that I believe to be particularly important:

First, and perhaps most importantly, new models of the world that ultimately prove correct occasionally generate weaker initial predictive accuracy than long established models that ultimately prove poor reflections of reality.  Consider the debate between proponents of the geocentric vs. heliocentric universe.  In western scientific history, geocentric models predated heliocentric alternatives, giving their (geocentric) proponents more time for model refinement.  This meant that geocentric models occasionally generated more accurate predictions than newer heliocentric models, even though we now know with certainty that the sun is the center of our solar system.  Thus, scientific progress often requires one step backwards in terms of predictive accuracy to ultimate take many steps forward (I borrow this point from Manzi).

Second, one of the major problems with reliance on in-sample testing is the possibility of overfitting: it is impossible to know if purported statistically significant covariates are simply fitting error or a true relationship.  Especially in empirical studies of conflict, inferences drawn from purely in-sample models simply cannot be trusted.  Rather, the gold standard for evaluating a model’s performance is out-of-sample (but don’t take it from me, see Beck, King, and Zheng 2000). In a perfect world, this would entail training a model on a dataset, then making predictions as we collect new data in real-time.  In practice, this can be difficult so we simulate this process by separating our data into an in-sample (often called training) and out-of-sample (often called validation or test) set.  It is critical to note that it is still possible to overfit a model using this out-of-sample set up.  This can occur when researchers do the following:

  1. train a model on the in-sample data
  2. evaluate its predictive performance on the out-of-sample set
  3. change the original models based on the results
  4. repeat this process until results become satisfactory

There are many techniques to avoid overfitting in an out-of-sample framework.  One of the most straightforward and commonly employed is to divide a dataset into three parts: training, validation, test.  To avoid artificially enhancing predictive through overfitting, simply add a fifth step:

5. pick a single model and tests its predictive accuracy on the test set

It is critical that the test set must only be used one time.  Note that various other iterative sampling approaches also exist to avoid overfitting.

Third, it is always possible that a model’s high degree of predictive accuracy is due entirely to luck.  To crudely illustrate this, I borrow an example that Warren Buffett to illustrate the role of luck in stock picking.  Imagine there was a NCAA basketball style, rock-paper-scissors single elimination tournament with ~1 billion people, with the winner getting paid $1million.  As each round progressed, the excitement would build.  After round 28, we would be down to 4 individuals.  By this point, they would be making the talk show rounds, likely describing their secret formula for success.  MSNBC would eat it up. Finally, a winner emerges, he/she probably lands a book deal with a title like “Get Rich Now: How my approach to paper rock scissors can make you millions”.  Meanwhile, the entire thing was luck.

With the number of people building predictive models (especially in finance), odds are than some models will achieve spectacular runs of success based entirely on luck.  Others may achieve success because their model is actually superior.  Differentiating between a dominant model and pure luck can be very difficult, especially when the outcome-of-interest is observed infrequently (like in elections).