June 2, 2010

Lost in log-space: residuals of log-transformed data

Yesterday I was helping Jim with a model for predicting non-profit registration by county, and a little problem came up about how to explain some data. He had found a really strong relation, ran the model, and made a pretty choropleth of the results. But the legend had entries like "-0.500 standard deviation to 0.500 standard deviation". And the underlying data had been log transformed. He didn't have time to rebuild all of his maps, so I was asked this question: "What do I tell them this means?"

Good question! Log transformation of data is a common technique to deal with several problems: typically anything where the scatterplot of a relationship look log-normal, "looks like an exponential curve", has "too long" of a rightward tail, or numerous other things. I will leave aside the question of whether the log transform is the right one in a particular instance, and proceed directly to today's (well, yesterday's) question:
When analyzing the residual deviations after the model is fit, what does the deviation of the log of the variable of interest mean? And make it friendly, statistics guy!
Well... ouch. I don't like the implication of "y'all are hard to understand". But then again, a lot of people feel that way, and that's why analysts are paid to be analysts. So what to do? What do "regular people" want to see, that is also a good representation of the reality?
Percent deviation from predicted.
And here's what they don't want to hear (but you want to do):
  • If you have "standardized residuals" (how very statistical of you!), first multiply by the RMSE to get actual deviations.
  • Take the exponent of the deviations, and respect the sign of the deviation! 
  • Now you have the ratio of measured over predicted. If it's greater than 1, the subtract one and multiply by 100 to get the percent "high". If it's less than one, then subtract the ratio from one and multiply by 100 to get the percent low.
Use those numbers for your labels, don't explain the funky transformation details to a lay audience, and enjoy the lack of flustered puzzlement on the audiences faces!

Example, just for practice:
Say your RMSE is 0.5. Then one deviation high (positive) is e to the 0.2: about 1.65; and one deviation low is e to the -0.5: about .61.

So one up is about 65% high, and one down is about 39% low.

One additional note here: symmetrical percent bands, like 10% to 40%  high or low, will be misleading because the "high" band is actually expected to have smaller counts than the "low" band if a log model is correct. But this is the way that people have become accustomed to having data presented, and people think of it as "fair", despite this potential bias.

No comments:

Post a Comment