Australian newspapers reports of polls are not always accompanied by statements summarizing details as to sampling, weighting, sample size, confidence intervals etc (but the reports of Newspoll in the Oz and the Nielsen polls in the SMH are actually reliably good on this score). For instance, in Tuesday’s Australian the table with the Newspoll results said

This survey was conducted on the telephone by trained interviewers in all states of Australia and in both city and country areas. Telephone numbers and the person within the household were selected at random. The data has been weighted to reflect the population distribution. The latest survey is based on 1169 interviews among electors. The maximum margin of sample error is plus or minus 3 percentage points. Copyright at all times remains with Newspoll. More poll information is available at www.newspoll.com.au

I’m going to make a few blog posts about this margin of error stuff over the next couple of days. I’ll start with one observation which comes out of some correspondence with a student at UCLA who was asking about models for election tracking:

Part of the estimation accounts for sampling noise, … using the standard [tex]p(1-p))/n[/tex]. But shouldn’t the sampling variance calculations be inflated … [due to the use of] post-sampling stratification weights…. Do you know how big of an inflation?

Yes, if you use post-stratification weights then the nominal calculation of the poll’s standard error (and its “margin of error”, usually the 95% confidence interval, or plus/minus 1.96 standard errors) needs to be inflated. The key term in the inflation is the variance of the weights. Now of course this is almost never reported. We usually get the **nominal** sample size, a statement that the data have been weighted, but almost always the report of the “margin of error” is based on the nominal sample size. For instance, in that latest Newspoll, with a nominal [tex]n[/tex] of 1,169, suppose one of the sample proportions we estimated was [tex]\hat{p} = .5[/tex], then the usual formula gives us

[tex]se(\hat{p}) = \sqrt{var(\hat{p})} = \sqrt{\displaystyle\frac{.5\times.5}{1,169}} = .0146[/tex]

and so plus or minus 1.96 times that gives us the “maximum (?) margin of error” of about 2.86 percentage points (or lets round it and call it 3 percentage points).

What do the (post-stratification) weights do? Well it depends on how variable they are across the sample. In the limiting case of weights that don’t vary at all (every case has the same weight), we’re back to the nominal computations given above. More likely is that the weights vary considerably across the data set: some categories of people are easy to contact (e.g., older, retired, more likely to be at home when the phone rings), some categories of people are more difficult (e.g., younger, urban, English not a first language, shift-worker, babies playing up when the phone rings, etc). I don’t have any experience with commercial phone polling in Australia, but if what I know about telephone polling in the United States has any relevance at all (it has at least some), then you wind up with weights that can vary from at least between .25 to 5 (i.e., yes, some cases are being given 20 times the weight of other cases, reflecting the relative ease/difficulty of contacting those people relative to their prevalence in the target population; believe me, I’ve seen a lot worse).

Now suppose that the weights go something like this: .25, .5, .75, 1, 2, 5, with respective shares of .1, .2, .2, .2, .2, .1. That is, 10% of the cases get the very low weight of .25, while 10% of the cases get the high weight of 5. Then the resulting standard error blows out to .0208, about 142% of the standard error produced by taking the nominal sample size at face value (I used the very slick survey package in R to do this; code appears below). The resulting 95% confidence interval is also 142% of the confidence interval we get from (naively, incorrectly) doing the calculations assuming no weighting: taking the weights into consideration, the 95% CI is plus or minus 4.08 percentage points, spanning, say, 8 percentage points, instead of the 6 percentage points spanned by the naive 95% CI.

So, with weights, there can be considerably more uncertainty there than the usual, naive, textbook calculations would suggest. You’ve got to know something about the weights in order to compute the correct margins of error, and usually we don’t get enough information from the polling organization so as to be able to do that. In the United States the big polling organizations like CBS News will supply this kind of thing upon request, or when they deposit their data with archives (e.g., the Roper Archive at the University of Connecticut). I haven’t asked any of the big Australian organizations for the raw, individual level data (with weights) so we could look at this, but I/we should. Indeed, it is possible that phone polling in Australia is doing great in terms of response rates across the population, such that the weighting is actually a minor aspect of the process, and the nominal standard error and margin of error is very close to the correct quantities after we take the (minor) weighting into account; I just don’t know, but I doubt it very much.

In the meantime, I wonder if we could ask the polling organizations to actually report the “margin of error” based on the appropriate calculation given their post-stratification weights? It may well be in their interests to do so, since the resulting inflation in the margin of error buys a little insurance against bias. Put simply, if 95% of my results are really the truth plus/minus 4 points, then I don’t want to say “95% of my results are `the truth’ plus/minus 3 points”, because eventually I’ll be caught out.

(more…)