Faults in Statistical Analysis and tPP’s Solutions


This continues our series of student reflections and analysis authored by our research team.


Continuing along the theme of the correlation we might find between attack lethality (i.e. the number of fatalities recorded from a terrorist incident) and affiliation with an FTO, there are several problems we may encounter as a team when running linear regressions on the variable “Number Killed” with any other variable. The tPP dataset, for one, is a dataset that codes terrorist incidents on an individual basis rather than an event basis. Because of this, when running a linear regression on“Number Killed” and “Affiliation with FTO”, for example, the scatterplot will include individual data points for each individual. This is problematic because when we consider cases that include multiple perpetrators, fatalities will be repetitively counted based on the number of perpetrators carrying out the attack. Take, for example, the case in my previous blog post which included six individuals who called themselves “The Family” and carried out attacks that were affiliated with the Animal Liberation Front (ALF) and the Earth LiberationFront (ELF). In the arson attack on BLM Wild Horse Corrals in Litchfield, California, all six individuals carried out the attack, be it through organizing or perpetrating the actual attack. Although there are no deaths which resulted from the incident, each of the six individuals which appear in our dataset are assigned a 0 value for the “Number Killed” variable. When any statistical software plots this attack on a graph, the zero deaths that resulted from this attack will be counted six different times, essentially as six different incidents. In other words, any regression which runs the variable“Number Killed” against another variable will be skewed and inaccurate.

In addition to the repetitive counting issue we see in regressions including the “Number Killed” variable, we also see a significant number of extreme outliers for the yes-valued entries under “Affiliation with FTO” when running a regression between the two variables (this regression was accomplished in my preliminary analysis on the correlation between attack lethality and FTO affiliation). Although, we also see a substantial number of outliers for the no-valued entries under“Affiliation with FTO”, most of the numerical values for these outliers are much lower than the numerical values of the yes-valued data points. Take, for example, the figure below. In this figure, we would not expect the relationship between “Number Killed” and “Affiliation with FTO” to be linear, rather we would expect the relationship to be exponential. Due to this skewed nature of“Affiliation with FTO”, a linear regression, again, would not accurately capture the relationship between the two variables. This is problematic because the linear equation we would obtain from running a linear regression on the scatterplot below, would not give meaningful results and our analysis would be distorted. In the preliminary analysis we ran on these two variables we concluded that there is a significant positive relationship between “NumberKilled” and “Affiliation with FTO”, but because of the two major issues 1) with the nature of the dataset 2) with the nature of the “Affiliation with FTO”variable, our analysis is mired in falsehoods.

To resolve the issue of repetitive counting in our dataset, we are in the process of compiling the entries in tPP to a secondary dataset which will account for all of the perpetrators in the dataset but will record each entry on a per incident basis. In other words, this new dataset will eliminate the counting errors experienced in our original dataset. The new dataset will allow members of tPP to run regressions on the“Number Killed” and “Number Injured” variables with other variables in our dataset and obtain accurate results. As tPP is approaching its fifth semester in existence, a separate “analysis” course has been created for team members to extrapolate constructive and meaningful results from our data. The new dataset will be crucial in furthering students quantitative analysis of our data.

Then, to resolve the issue of skewedness in the above scatterplot, and in my existing regression, the variable “NumberKilled” will need to be logged on “Affiliation with FTO”.

This equation will come closer in producing a realistic relationship between “Number Killed” and “Affiliation with FTO”.

As tPP moves forward, it is our goal to always analyze our data in an accurate and ethical manner. The problems that we have encountered thus far are in the process of being resolved. We will continue to resolve any issues we notice along the way.

  • Meg