Confidence intervals for random forest partial plots

Everyone loves random forests (RFs).  The algorithm is powerful, intuitive, and easily implemented in most languages and statistical platforms.  However, a common critique is that it is difficult to extrapolate substantive effects from a trained random forest.  This is especially true amongst the academic crowd who are used to coefficients and magical ***’s.  Alas, R’s randomForest package and Python’s sklearn both provide “partial plot” methods, which demonstrate the substantive effects on an independent variable (IV) on the dependent variable (DV).  But, there are two key issues with the resulting plots. First, they are ugly.  Second, they do not provide confidence intervals (CI’s), which can make interpretation difficult relative to GLM-based marginal effects plots.

Taken directly from the randomForest partialPlot documentation: “The function being plotted is defined as (the function below) where x is the variable for which partial dependence is sought and x_{1,C} is the other variables in the data.”

\widetilde{f}=\frac{1}{n}\sum\limits^{n}_{n=1} f(x, x_{1,C})

Below is a crude representation for how the partialPlot function works.  Consider the data below (Y=dependent variable, var_1, var_2, and var_3 are independent variables):

y var_1 var_2 var_3
12 4 7 1
18 6 1 4
20 18 4 2
14 16 4 7
Continue reading

Data and code to build district-month predictions of future violence in Afghanistan

Earlier this week, I posted an article about building predictions of future levels of violence at the district-month level for Afghanistan.  Here is there data and the code is below.  Note that it takes about 30 minutes (not 5) to run.  Also note that it generates LOTS of errors since many of the time-series have long runs of 0’s, but eventually these get predicted to be 0’s, so it’s all good.  Let me know if you would like the province- or country-level data or code.  Efficiency wasn’t really my goal with this code (it’s crude and clunky but achieves good predictive accuracy), the loop structure is slow but should be pretty easy to follow.  Please let me know if anything doesn’t make sense.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
##programmer: JEY
##2/28/2013
##District-month forecasts
##note: this takes about 5 minutes to run. It throws lots of errors due to many all zero time series, but it will eventually complete and build predictions
 
rm(list=ls())
library(foreign)
library(forecast)
library(sos)
data<-read.dta("raw_AFG.dta")
 
month=0
error.naive=0
error.arfima=0
start=580
end=627
for (k in start:end){
month<-k
error<-matrix(data=0, nrow=317, ncol=3)
colnames(error)<-c("true", "naive", "arfima")
for (i in 1:317){
test<-subset(data, newid==i &monthly<month)
a<-test$count
unif<-runif(length(a), min=0, max=.1)
a<-a+unif #this minor addition allows the arfima to converge
fit<-arfima(a)
y_hat<-forecast(fit, h=1)
y<-as.vector(y_hat$mean)
error[i,3]<-y
true<-subset(data, newid==i &monthly==month) #actual leve of violence
error[i,1]<-true$count
naive<-subset(data, newid==i &monthly==(month-1)) #naive predict of t-1
error[i,2]<-naive$count
}
error<-round(error, digits = 0)
error.naive[k-(start-1)]<-(sum(abs(error[,2]-error[,1])))/317
error.arfima[k-(start-1)]<-(sum(abs(error[,3]-error[,1])))/317
}
 
error.naive
error.arfima
compare<-cbind(error.naive, error.arfima)
dif<-as.matrix(error.naive-error.arfima)
results<-cbind(compare, dif)
colnames(results)<-c("error.naive", "error.arfima", "difference")
write.csv(results, "results_district.csv")
 
 
view raw gistfile1.r hosted with ❤ by GitHub

The Effects of Intra-state Conflict on Interstate Conflict: An Analysis of GDELT

The release of the GDELT dataset has been receiving a lot of attention, and rightfully so (see Foreign Policy’s write up here, Jay Ulfelder’s write up here, and Phil Schrodt and Kalev Leetaru’s official release paper here).  Last week, I posted a chapter of my dissertation that used GDELT to build predictions of violence in Afghanistan (see the posts below).  Below is a chapter that I wrote that uses GDELT to provide a rigorous, dyad-month level analysis of the effects of domestic conflict on interstate conflict.

The Effects of Domestic Conflict on Interstate Conflict: An Event Data Analysis of Monthly Level Onset and Intensity 

A Tale of Two Ph.D.s

On Friday, Slate published an article called “Thesis Hatement: Getting a Literature Ph.D. will turn you into an emotional trainwreck, not a professor”, written by Rebecca Shuman.  Since I don’t know Shuman or anything about a humanities Ph.D. program, I’ll withhold judgments that I would otherwise be inclined to make.  I just figured I’d share a bit about my experience getting a Ph.D., since it is pretty much the exact opposite of Shuman’s.

I began the Ph.D. program in the department of political science at Penn State in the fall of 2009 with no prior graduate schooling.  Penn State, like most major universities, covers all tuition for students accepted into the Ph.D. program.  It also provides a stipend that covers the basic cost of living. Early on in my first semester, I told the faculty that my intention was to finish the Ph.D. and then work for either the government or enter the private sector.  Reactions ranged from neutral to highly supportive.

Last month, I completed my Ph.D.,  a little over 3 and a half years since I started.  During that time, I took nearly a dozen methodological courses, four of which were outside the department but still fully funded.  I taught an undergraduate class, did outside consulting work for the government, presented at conferences, and got a few publications.  I also wasted a hell of a lot of time, and not in Shuman’s “grad school is a waste of time” sense, but in the “going to bars to play pool and drink $5 pitchers of Bud Light” sense, meaning that I certainly could have been more productive.

The training I received as a Ph.D. student qualified me for a host of jobs, ranging from think tanks to government to academia to the private sector.  In January, I accepted a job as a data analyst with Allstate Insurance and started in March.

I must acknowledge that I was lucky, since my department gave me (and the other students in the program) considerable freedom and support to pursue whatever interests.  I also had an incredibly good advisor.  I have no idea if my experience is indicative of other social science Ph.D. program and am not claiming that it is.  All I know is that for me, the process was rewarding, fairly fast, and led directly to a good job.

Ok, so I lied —  I will pass judgment on one point Shuman makes.  She claims that a humanities BA is among the most hireable, which is just flat out wrong (note that the supporting article was written in 1997, when Zuckerberg  was in middle school, Jobs was just beginning his second stint at Apple, and Bieber was 3.  A few things have changed since then.)  If you don’t believe me, how about a friendly wager?  Call up the career services of a few major universities.  Ask them if an undergraduate majoring in computer science with a minor in economics has a better chance of getting a job than an undergraduate majoring in philosophy with a minor in English literature.  I’ll bet any amount of $$$ that they pick the former.

Using GDELT to forecast violence in Afghanistan

The Global Dataset of Events, Location, and Tone (GDELT) is a new, 230 million (and growing daily) is the first ever machine-coded political event data dataset to provide information on event location.  For those attending ISA, Kalev Leetaru and Phil Schrodt will be formally introducing the GDELT dataset.  The full dataset will be publicly available soon, but for now you can access an older version here.

From a forecasting perspective, the benefits of a machine-coded dataset updated in (near) real-time that provides specific latitude-longitude coordinates are numerous.  In the first ever empirical analysis using GDELT (pdf of paper –> “Predicting Future Levels of Violence in Afghanistan Districts with GDELT“), I build an empirical model that predicts the level of conflict at the district-month level in Afghanistan.  Below is .gif that Joshua Stevens built using GDELT that reflects the distribution of conflict events in Afghanistan over time.

Moore’s Law and Event Data

In 1965, Gordon Moore predicted that the number of transistors on integrated circuits would double every 2 years.  By 1970, the term “Moore’s Law” was coined.  Since then, Moore’s Law has proven shockingly accurate in  not only its intended domain (transistors on chips) but across a number of other areas, such as hard disk storage and pixels (see Wikipedia and Microsoft’s take).  Recently, I found that Moore’s Law is also applicable to the number of observations in the largest political event data datasets.  Below are the key milestones in political event data.  Note that the size of WEIS is a general estimation since the original dataset no longer exists.

  • 1978 – World Event Interaction Survey (WEIS): 2,000 observations
  • 1996 – Kansas Event Data Set (KEDS): 225,000 observations
  • 2004 – 10 Million International Dyadic Event dataset: 10,000,000 observations
  • 2012 – Global Database of Events, Location, and Tone (GDELT): 220,000,000 observations
Below, these true values are graphed against what Moore’s Law would predict, knowing only that that largest dataset in 1978 was ~1,800 observations. For visual appeal, I plot using a logarithmic scale.

Although the accuracy of Moore’s law in the above graph is incredible, what is even more interesting is what this suggests about the future.  If the trend holds, the largest political event data dataset should be approaching 1 billion observations around 2016/2017.

A (slightly) better way to evaluate out-of-sample performance on TSCS data

The international relations conflict literature is dominated by time-series cross sections datasets with a binary dependent variable (henceforth referred to as BTSCS).   The majority of studies using BTSCS data use a logit/probit model to draw inferences on one or a few “key” variables of interest or compare models.   Although the most empirically justified way to actually test which model is better or the importance of a certain “key” variable through evaluating out-of-sample performance, alarmingly few scholars using BTSCS data actually do.  Beck et. al. (2000) is a primary exception and current “best practice” approach.  Here is a quick recap of their approach:

Using the Tucker (1997) dataset (BTSCS at the dyad-year from 1947 to 1989), Beck et. al. set 1960-1985 as the in-sample data and use 1986-1989 as the out-of-sample data to evaluative model performance.  This is not a bad approach and is infinitely better than purely relying on in-sample metrics.  However, this approach means that predictions for 1989 are based on a training model that does not include data from 1986, 1987, and 1988.  Thus, potentially valuable data is unnecessarily omitted.

With this in mind, I suggest an alternative “rolling” approach, which iteratively expands the training set by one temporal unit in a way that is conceptually similar to the idea of “online learning” in the machine learning world.  The benefits of the rolling approach are largely two-fold.  First, unlike Beck et. al’s method, the rolling model uses as much data as possible for each forecast.  Thus, whereas the Beck approach attempts to generate forecasts for 1989 using training data from just 1947 to 1985, my approach incorporates information from 1986, 1987, and 1988.    Second, since performance scores are calculated for each year rather than just on one chunk of years (i.e. for 1986, 1987, 1988, and 1999 individually, and opposed to just 1986-1999 cumulatively), researchers have more information to use when comparing models. Continue reading

What can Kimbo Slice teach us about predictive models?

In the backyards of Miami, my money is on Bueno de Mesquita, every time.

I have tried very hard to keep my website/blog/twitter focused solely on my professional interests – using open source data to forecast socially driven outcomes.  Now, I’m going to bend the rules a bit and bring in my love of combat sports in order to apply lessons from mixed martial arts (MMA, or “cage fighting” or “ultimate fighting” to the layperson) to forecasting models.

Last Saturday, Fox aired “UFC on Fox 5”, which was undoubtedly the greatest collection of MMA talent ever aired on a free TV.  The main event – Nate Diaz vs. Benson Henderson – garnered 5.7 million views.  Not bad, but consider this: Kimbo Slice has fought 3 times on free TV, each time surpassing 6 million views (6.1, 6.45, and 7.1 million to be exact).  Until a string of losses that exposed him as a D level fighter, Kimbo was among the biggest draws in all of combat sports.  But why? I believe that it derives from our obsession with the mythical.  In the context of fighting, we seem to be drawn to someone who, for untraditional reasons, seems to be invincible.  In terms of MMA, the fighters who achieve mythical status share two things in common: 1) untraditional or secret training methods and 2) a string of dominant victories over easy opponents.  As I argue a bit later, the same is true of predictive models Continue reading

3 things to pay attention to when analyzing predictive accuracy

{{insert obligatory Nate Silver reference to connect the content of this post to current events}}

As is readily obvious from the content of this blog, I think and write frequently about predictions, and I constantly advocate that the only way to test whether a person or model (of the non-person type, unfortunately) actually helps us better “understand” the world is to evaluate how well he/she/it can predict. Despite the emphasis of predictive accuracy, there are a number of complications to keep in mind when evaluating a model’s predictive accuracy that tend to be overlooked.  Here are three that I believe to be particularly important:

First, and perhaps most importantly, new models of the world that ultimately prove correct occasionally generate weaker initial predictive accuracy than long established models that ultimately prove poor reflections of reality.  Consider the debate between proponents of the geocentric vs. heliocentric universe.  In western scientific history, geocentric models predated heliocentric alternatives, giving their (geocentric) proponents more time for model refinement.  This meant that geocentric models occasionally generated more accurate predictions than newer heliocentric models, even though we now know with certainty that the sun is the center of our solar system.  Thus, scientific progress often requires one step backwards in terms of predictive accuracy to ultimate take many steps forward (I borrow this point from Manzi).

Second, one of the major problems with reliance on in-sample testing is the possibility of overfitting: it is impossible to know if purported statistically significant covariates are simply fitting error or a true relationship.  Especially in empirical studies of conflict, inferences drawn from purely in-sample models simply cannot be trusted.  Rather, the gold standard for evaluating a model’s performance is out-of-sample (but don’t take it from me, see Beck, King, and Zheng 2000). In a perfect world, this would entail training a model on a dataset, then making predictions as we collect new data in real-time.  In practice, this can be difficult so we simulate this process by separating our data into an in-sample (often called training) and out-of-sample (often called validation or test) set.  It is critical to note that it is still possible to overfit a model using this out-of-sample set up.  This can occur when researchers do the following: Continue reading