One statistics lesson that Ed West does not want to receive

In the course of writing a forthcoming rant, I unearthed an old article in the Telegraph by someone called Ed West. It's called The one inequality infographic no one on the Left wants to see. In short, it's crap.

The rest of this article will just expand on that short statement. While the article is of no apparent merit, a detailed rebuttal may be of interest to someone.

So here's the original, super-exciting image, the much-feared scourge of the entire Left, in all its amazing high-resolution glory:

Ed West's original graph

Starting at the beginning, it raises eyebrows to describe the thing as an "infographic" in the first place. Here is what you might call an infographic. Here is is another infographic. Here is yet another goddamn infographic. The word that West is doing such a poor job of finding is "graph", or maybe "chart".

And it's a graph that doesn't explain itself particularly well. If you read the accompanying article, you discover the vital fact that the dots represent US states, and it is supposed to prove that immigration and equality are incompatible. I may have been being needlessly stringent in my definition of "infographic" above, but we really can rule out graphs that can't even be interpreted without an accompanying Telegraph article.

There is then the obvious boo-boo that the Y-axis's lengthy and cumbersome label includes a date, but the X-axis doesn't. Are we to assume that those figures are the Gini coefficients measured at the same time, July 2008? Or from some other unspecified time? It would be nice to be told.

Reverse engineering

I decided to investigate the data a bit. Unfortunately, West has not been so sociable as to share his sources for easy analysis. I emailed him, and he sent me not the data, but loads of statistical conclusions which could be computed from that data in thirty seconds or so.

So for the rest of this article, I'll be taking his graph at face value, which will come to seem unacceptably generous by the time I'm done.

I did the best I could: I enlarged the graph severalfold, and (using Pinta) drew a single purple pixel in the centre of every point to produce a version for easy computerised analysis.

Then I produced a script (written in Python) to extract the datapoints. I then immediately performed a sanity check, using gnuplot to check that I could produce a fair copy of the original graph:

My copy of the graph, for analysis purposes

In doing this I discovered there were only forty-nine points visible in the graph (mysteriously, one less than the number of US states). A charitable explanation is that the graph was drawn incompetently, and two of them were close enough to appear superimposed in it. A less charitable explanation is that one point was accidentally omitted. An even less charitable explanation is that one point was maliciously omitted. It is unfortunate that West has not been good enough to leave us a data trail, so we can determine which.

Some analysis

The points of the graph don't tell much of a story, actually: any visible trend obtained from looking at the points is rather weak. One observation is that the least unequal US state is the fifteenth-most ethnically diverse. That fact alone raises some serious challenge to West's thesis: from one example you might see that it is possible to have both immigration and equality; we merely need to find out how it's done. Again, thanks to West's workshy attitude towards proper documentation, we can't go and find out the details.

What you're clearly supposed to see is the line of best fit: the big black line on the original, which has become green in my knock-off copy, which is supposed to represent the relationship between the two quantities. It goes without saying that this line is not massively impressive: the data does not conform massively well to that line.

In my copy, the red line has been calculated from the data, rather than copied over. While (as I may have mentioned before), West does little to explain the graph, the fact that my line coincides very closely to his means that it is quite likely they were produced by the same method, the popular ordinary least squares method of linear regression.

One would imagine that this technique was simple enough to be foolproof, but sometimes fools can be very creative. In this case, there has been a heinous logical error.

The model that has been used would be the correct one if there was reason to believe that immigration is a result of inequality. The graph is drawn in the usual way according to that working assumption, and (much more importantly) the line of best fit has been drawn according to that assumption. But that's the complete opposite of West's thesis: his working assumption has been that inequality is a result of immigration.

Let's redraw the graph accordingly: we'll not just flip the axes, but much more importantly, we'll recalculate the line of best fit correctly:

A copy with saner axes

Notice the difference: according to the original, incompetently-calculated line of best fit, going from 90% white to 60% white is associated with a massive increase in Gini coefficient: from 0.4 to 0.5 or thereabouts. As soon as we actually bother to calculate the right bloody line, the trend becomes much tamer: going from 90% white to 60% white only increases the Gini coefficient from about 0.44 to 0.46: even if this turns out to be accurate (see below), it's a small contribution, dwarfed by other unknown factors and all but lost in the noise of the general variation between states.

Why's there a difference?

The untrained reader may not yet appreciate the difference, and so not realise quite what a silly mistake this is.

Suppose I poll six people on how many cups of coffee they drink and how many cakes they eat per day, and get the following results:

Coffee and cakes

Suppose we think that coffee makes people hungry, and ask what the relationship is, and decide to guess by drawing a line of best fit on these results. What is that line? Well, there are two people who drink three cups of coffee a day, and a good fit passes as close as possible to halfway between them. Similarly, there are two people who drink four cups a day, and ideally we want to pass halfway between them too. Lastly, there are two people who drink five cups a day, and we would like to pass halfway between them too. That gives us three points, and luckily there is a line passing through all three, which is hence the line of best fit.

Suppose on the other hand we think that cake makes people thirsty, and ask what the relationship is there instead, and we decide to draw a line of best fit accordingly. Similarly, we have three pairs of points: the people who eat one cake a day, the people who eat two cakes a day, and the people who eat three cakes a day. There is a line passing exactly halfway between each pair again, and that's the line of best fit.

Both are illustrated here:

If coffee causes cake consumption If cakes cause coffee consumption

Note that in the second graph the line of best fit is four times steeper than in the first graph, despite being based on the same data. In other words, if you think that cake eating causes coffee consumption, you expect an increase of two cups of coffee to be associated, on average, to an increase of four cakes. If, however, you think that coffee consumption causes cake eating, you expect an increase of two cups of coffee to be associated, on average, only to an increase of one cake.

The numbers are much more complicated in West's favourite graph, but the basic error, and the inaccurate results it produces, are exactly the same.

Conclusions

Clearly, this graph, which West considers important enough to occasion an article, has been produced incompetently, and does not support the alarming conclusions that one is meant to draw from it.

Such a graph is only very weak evidence of causation: just because two things are related, we cannot conclude that one caused the other. It could very well be that something else sometimes causes both: for example, minimum-wage employers might be expected both to encourage settlement of economic migrants, and to increase inequality. Hence, for all we know, the correlation could be caused in large part by economic conditions which encourage a large amount of minimum-wage labour.

There are also sorts of causation which, while uncontroversial, don't support the conclusions which West seeks to draw. For example, economic migrants usually arrive poor. Even in a highly equal society, with an aggressively redistributive economic policy (which the US isn't) they will take some time to catch up. That would be evidence that the Gini coefficient doesn't measure inequality properly, rather than evidence of inequality.

Somewhat more pressingly, it's clear that West isn't even actually addressing the things he thinks he is. A large number of the people he is managing to count as "immigrants" (more than an eighth of the US population) are black African-Americans, many of whose families have been in the USA for centuries. The fact that they are poor speaks ill of many aspects of American society and economic policy, but says nothing at all about immigration.

As such, it is hard to read this graph as providing evidence that immigration causes inequality, and somewhat easier to read it as supporting things that people on the Left have been saying forever: that, to varying amounts in varying places, ethnic minorities are being oppressed.

So this graph is very weak evidence of many things, but West's claims about it are rather strong evidence that he has not the faintest clue what he's talking about. It goes without saying, given that they published it, that no editor at the Telegraph has much of a clue either: this is the newspaper that currently employs the professional liar James Delingpole, and a rogue's gallery of other nutcases.

So, in summation, I can understand why my colleagues on the Left might not want to see it: most of them don't wish to waste their time on nonsense.