Tuesday July 31, 2007

Filed under: general — jackman @ 9:21 am

Ingmar Bergman yesterday, Coach Walsh and Tom Snyder today. And Chief Justice Roberts has a seizure. Enough is enough.

You’d see Coach Walsh down at the Stanford Sports Cafe. The last time I saw him was over lunch down there maybe 6 weeks ago; he was having a laugh with Steve Young and the current coach, Jim Harbaugh. He seemed in great health and great spirits. I didn’t know he been so sick until a colleague filled me in. And leukemia also got Tom Snyder.
T1 Walsh Si

Comments Off on farewells

wedging, a definition

Monday July 30, 2007

Filed under: Australian Politics — jackman @ 5:31 pm

Crikey today gave us the dictionary definition of wedging. Here’s the political science version, from an older 1996/97 article of mine on the Hanson phenomenon in Australian politics: “Pauline Hanson, the Mainstream, and Political Elites: the place of race in Australian political ideology”, Australian Journal of Political Science, 1998, 33:167-186. Figure 1 from that article appears below (click on the thumbnail). A long quote gives the wordy version, with some cites to the literature:

The top panel of Figure 1 shows a polity dominated by two major parties, each encompassing a wide array of opinions on racial issues, cutting across the dominant “left-right” ideological dimension along which most political issues are contested. Opening up ideological debate to a cross-cutting dimension like race (lower panel of Figure 1) sees elites risk dividing their own base of supporters (e.g., McAllister 1993, 162); partisans who are in reasonable agreement on the dominant “left-right” ideological dimension could easily find themselves in disagreement on questions to do with issues related to race (not an unreasonable assumption in the Australian context, as we shall see), creating opportunities for political rivals to form new party groupings. A rational strategy for party elites faced with this possibility is to downplay the salience of the cross-cutting ideological dimension, keeping political conflict oriented on the established “left-right” dimension (e.g., Riker 1984). But if the parties differ in the extent to which they face de-stabilization from a cross-cutting ideological dimension, then elites of the more cohesive party might well be tempted to heighten the salience of the cross-cutting racial dimension, in an attempt to divide-and-conquer one’s political opponents, with the benefits of luring partisans from one’s opponents more than offsetting any losses arising from the de-stabilization among the ranks of one’s own partisans (lower panel of Figure 1).

This hypothetical example makes clear that there is a lot at stake in Australia’s so-called “race debate”. Aside from the moral questions race-related issues pose, the divisions caused by race can have dramatic political consequences as well. Cross-cutting issues like race can act as an ideological wedge, prying apart established party groupings, or giving rise to new political parties (e.g., Hanson’s One Nation party), and perhaps even a realignment of the parties and a redefinition of the ideological terrain over which they compete.


Comments (8)

trip report (photos), SYD-BNE-SYD

Saturday July 28, 2007

Filed under: flight nerdery — jackman @ 11:18 am

07.24.2007 YSSY-YBBN DJ221

07.26.2007 YBBN-YSSY DJ242
YBBN 01 … MANFA1 STAR (I’m guessing) – YSSY 34L

Comments Off on trip report (photos), SYD-BNE-SYD

large graphics files from R, a workaround

Friday July 20, 2007

Filed under: computing,statistics — jackman @ 4:44 pm

I use MCMC a lot, and do a lot of expository talks on the subject. When making graphs of MCMC output, so as to demonstrate, say, slow mixing, I sometimes wind up trying to plot massive amounts of data, with PDF files in the tens to hundreds of megabytes; PDF is the format I go with because I usually want to pull these into pdflatex (although I suppose I could use another graphics format, but see below), and all my PDF viewers take forever to render massive PDF files.

Of course I usually “thin” (keep every [tex]n[/tex]-th iteration, e.g., [tex]n[/tex] = 10, 25, 50) when working with long streams of MCMC output, but sometimes you want the reader to see what slow mixing really looks like, in all its foulness. Today I came up with the following workaround:

  1. open the massive, hi-res PDF file in Preview (I’m on a Mac)
  2. save as JPG (highest quality)
  3. save the just-created JPG as PDF, call it <xxx>LoRes.pdf or something like that.

This works ok for me, and produces tolerable looking output, much, much smaller than the original PDF (e.g., 269KB vs 33MB, or 126 times smaller; thumbnail appears below). Creating JPG or PNG from R goes close, but always looks not so great, at least to my eye.


Comments (3)

absolute path names are stupid, amateur

Thursday July 19, 2007

Filed under: computing,statistics — jackman @ 11:29 am

Gripe of the day: when you get shipped a set of files to create data or replicate an analysis, and the paths to data files etc contain absolute paths. For instance, from the American National Election Studies, some Stata code to create the 2000 data sets.

infix using c:\anes\anes_2000prepost\20051006\anes_2000prepost_col.dct

do c:\anes\anes_2000prepost\20051006\anes_2000prepost_lab.do

do c:\anes\anes_2000prepost\20051006\anes_2000prepost_fmt.do

do c:\anes\anes_2000prepost\20051006\anes_2000prepost_cod.do

*do c:\anes\anes_2000prepost\20051006\anes_2000prepost_md.do

save c:\anes\anes_2000prepost\20051006\anes_2000prepost.dta

because of course everyone uses a Windows machine…with that directory structure and names, right?

Comments (3)

prediction with pscl functions, zeroinfl and hurdle

Monday July 16, 2007

Filed under: statistics — jackman @ 1:15 am

Matthew Browne from Australia’s CSIRO writes:

I had a query about estimating with pscl, and Bill Venables suggested I send you a note. It seems that the functions zeroinfl() and hurdle() do not include an option for outputting standard errors of the predictions in the predict() method.

I was wondering if you could give me any advice on how to go about generating these?

Good question, and Achim Zeileis and I are working on a solution. The easiest thing to do is a parametric bootstrap, which is pretty easy to code up. We’ve given Matthew a code fragment that is really not much more than proof-of-concept, and we’ll work up something for real for the next iteration of pscl.

Comments Off on prediction with pscl functions, zeroinfl and hurdle

Housing prices (George Megalogenis in The Australian)

Sunday July 15, 2007

Filed under: Australian Politics — jackman @ 7:21 am

Nice work by George Megalogenis in The Australian. He (and the Oz) commissioned some housing price data from an outfit called Australian Property Monitors, looking at changes in housing prices by electorate. George and the Oz have presented the electorates recording the top 20 price falls, and the top 20 price increases. The interesting thing is that among the 20 price falls, only four are designated as “marginal seats”. And of those, only three are notionally or actually Liberal-held seats, Lindsay, Dobell and Parramatta (notionally Lib after the 2006 redistribution), and the 2004-2007 housing price falls range between 11.7% in Banks (ALP) to 2.9% in Dobell (Liberal).

Among the divisions recording the 20 highest price increases, 15 are in Western Australia (!), 4 are Queensland (but not in Brisbane), and the other is the Northern Territory; 6 divisions are designated as marginal, but only two are held by the Liberals (Stirling and Hasluck in WA, where housing prices are up 100% and 114.7% since 2004, respectively); I say “but” since the idea is that since housing prices are doing so well there, then perhaps these seats will buck what looks like a substantial trend against the government. Continuing this line, is it possible that voters in places like Capricornia (114% price gains, ALP held seat on a 4.4% margin), will also buck the trend? I also note that WA has made for some interesting election nights, posting surprisingly strong results for the Coalition around the time the picture on the Eastern states has become clear. In short, I don’t see the housing market story as all one-way traffic for either side, but this research certainly does offer a rationalization for the pattern of poll results reported on last week, where NSW seems to be the most pro-ALP state at the moment.

The Oz has been relatively “research-forward” of late; they also commissioned some aggregations of census data from ABS that they published two Mondays ago. Its heartening to see the media actually rounding up data like this: even better would be if they would push it up on their site as a spreadsheet for the rest of us to download and play with (while the Oz can commission data and analysis from ABS and commercial data providers, academics can’t afford it).

Megalogenis foreshadowed this stuff in his appearance on Insiders on Sunday, with some references to housing prices in Sydney’s west, a “two-speed housing market” etc, made me curious as to where he was getting his info, now we know.

In a piece I wrote for Mortgage Nation, the Sims and Warhurst edited collection of essays on the 2004 election, I created a mortgage exposure (“stress”) variable for each electorate by taking [tex]p_i[/tex], the proportion of dwellings in division [tex]i[/tex] classified as “being purchased” in the 2001 Census, [tex]m_i[/tex] the median monthly housing loan repayment in division [tex]i[/tex], and [tex]y_i[/tex], median family weekly income, and forming the indicator [tex]z_i = (p_i m_i)/y_i[/tex], and then normalizing [tex]z_i[/tex] to range from zero to one.

The result is a variable that has a mean of .42, with the 5-th percentile at .11 and a 95-th percentile at .81. The minimum value of this variable, zero, is recorded in Wentworth (NSW), a seat that is both reasonably wealthy and has a high proportion of fully owned and rented dwellings, while the maximum value of one is recorded in Holt (VIC), an outer-metropolitan division in Melbourne with 48.4% of dwellings classified as ‘‘being purchased’’ (the highest such percentage in the country).

This variable held its own in a multiple regression analysis of swing, division by division, recorded at the 2004 election (coefficient of 3, standard error of 1.31, in the presence of numerous other predictors; dependent variable was two-candidate swing to the Coalition, so the implication is that as mortgage exposure goes up, net of other predictors, then so too did Coalition support). The full table of regression results appears below; lots of other controls, including change in Coalition candidate ballot position, incumbency status, etc.

I know this is aggregate-level analysis, and I’m mindful of the perils of cross-level inference, but I’m actually reasonably fond of this model of analysis: the data are reasonably good (Australia has a census every five years), there are 150 divisions in the country, there usually isn’t wild amounts of heterogeneity within each division, and its of great political relevance (since governments are elected division-by-division, not from the national-level 2PP number or preferred PM or whatever). It would be interesting to re-do this analysis either in predictive fashion this time around, or (a little more safely!) post-election, with actual housing price data.


Comments (1)

what is the right “margin of error”? (Part One, weighting)

Thursday July 12, 2007

Filed under: Australian Politics,statistics — jackman @ 5:17 pm

Australian newspapers reports of polls are not always accompanied by statements summarizing details as to sampling, weighting, sample size, confidence intervals etc (but the reports of Newspoll in the Oz and the Nielsen polls in the SMH are actually reliably good on this score). For instance, in Tuesday’s Australian the table with the Newspoll results said

This survey was conducted on the telephone by trained interviewers in all states of Australia and in both city and country areas. Telephone numbers and the person within the household were selected at random. The data has been weighted to reflect the population distribution. The latest survey is based on 1169 interviews among electors. The maximum margin of sample error is plus or minus 3 percentage points. Copyright at all times remains with Newspoll. More poll information is available at www.newspoll.com.au

I’m going to make a few blog posts about this margin of error stuff over the next couple of days. I’ll start with one observation which comes out of some correspondence with a student at UCLA who was asking about models for election tracking:

Part of the estimation accounts for sampling noise, … using the standard [tex]p(1-p))/n[/tex]. But shouldn’t the sampling variance calculations be inflated … [due to the use of] post-sampling stratification weights…. Do you know how big of an inflation?

Yes, if you use post-stratification weights then the nominal calculation of the poll’s standard error (and its “margin of error”, usually the 95% confidence interval, or plus/minus 1.96 standard errors) needs to be inflated. The key term in the inflation is the variance of the weights. Now of course this is almost never reported. We usually get the nominal sample size, a statement that the data have been weighted, but almost always the report of the “margin of error” is based on the nominal sample size. For instance, in that latest Newspoll, with a nominal [tex]n[/tex] of 1,169, suppose one of the sample proportions we estimated was [tex]\hat{p} = .5[/tex], then the usual formula gives us

[tex]se(\hat{p}) = \sqrt{var(\hat{p})} = \sqrt{\displaystyle\frac{.5\times.5}{1,169}} = .0146[/tex]

and so plus or minus 1.96 times that gives us the “maximum (?) margin of error” of about 2.86 percentage points (or lets round it and call it 3 percentage points).

What do the (post-stratification) weights do? Well it depends on how variable they are across the sample. In the limiting case of weights that don’t vary at all (every case has the same weight), we’re back to the nominal computations given above. More likely is that the weights vary considerably across the data set: some categories of people are easy to contact (e.g., older, retired, more likely to be at home when the phone rings), some categories of people are more difficult (e.g., younger, urban, English not a first language, shift-worker, babies playing up when the phone rings, etc). I don’t have any experience with commercial phone polling in Australia, but if what I know about telephone polling in the United States has any relevance at all (it has at least some), then you wind up with weights that can vary from at least between .25 to 5 (i.e., yes, some cases are being given 20 times the weight of other cases, reflecting the relative ease/difficulty of contacting those people relative to their prevalence in the target population; believe me, I’ve seen a lot worse).

Now suppose that the weights go something like this: .25, .5, .75, 1, 2, 5, with respective shares of .1, .2, .2, .2, .2, .1. That is, 10% of the cases get the very low weight of .25, while 10% of the cases get the high weight of 5. Then the resulting standard error blows out to .0208, about 142% of the standard error produced by taking the nominal sample size at face value (I used the very slick survey package in R to do this; code appears below). The resulting 95% confidence interval is also 142% of the confidence interval we get from (naively, incorrectly) doing the calculations assuming no weighting: taking the weights into consideration, the 95% CI is plus or minus 4.08 percentage points, spanning, say, 8 percentage points, instead of the 6 percentage points spanned by the naive 95% CI.

So, with weights, there can be considerably more uncertainty there than the usual, naive, textbook calculations would suggest. You’ve got to know something about the weights in order to compute the correct margins of error, and usually we don’t get enough information from the polling organization so as to be able to do that. In the United States the big polling organizations like CBS News will supply this kind of thing upon request, or when they deposit their data with archives (e.g., the Roper Archive at the University of Connecticut). I haven’t asked any of the big Australian organizations for the raw, individual level data (with weights) so we could look at this, but I/we should. Indeed, it is possible that phone polling in Australia is doing great in terms of response rates across the population, such that the weighting is actually a minor aspect of the process, and the nominal standard error and margin of error is very close to the correct quantities after we take the (minor) weighting into account; I just don’t know, but I doubt it very much.

In the meantime, I wonder if we could ask the polling organizations to actually report the “margin of error” based on the appropriate calculation given their post-stratification weights? It may well be in their interests to do so, since the resulting inflation in the margin of error buys a little insurance against bias. Put simply, if 95% of my results are really the truth plus/minus 4 points, then I don’t want to say “95% of my results are `the truth’ plus/minus 3 points”, because eventually I’ll be caught out.


Comments (2)

pooling the polls, in JAGS

Filed under: statistics — jackman @ 1:36 am

David Peterson writes:

Hey Simon

I have what I hope to be a quick favor to ask of you. I have a student
trying to figure out how to estimate a Bayesian state space model and is
having trouble with the code in R. I was wondering if I could get a copy of
the code you used for your paper at Methods last year. I haven’t done any
of these models and am having trouble helping him.



I actually don’t do this in R. I use R for the data setup, then hand off to JAGS for estimation/inference. I have some extensive notes etc including the JAGS code in the replication archive for my AusJPS article (“Pooling the Polls”) on the research part of my site (there is a zip file there, about item number 13 or 14); the JAGS/WinBUGS/OpenBUGS code for this problem is really simple, but the resulting MCMC algorithm constructed by those models is computationally inefficient; the issue is that in the JAGS/BUGS paradigm, the latent states are treated as individual parameters and MCMC proceeds by sampling from their conditional distributions “one-by-one” rather than en bloc (sampling from the joint distribution of the latent states), and so this mean you’ve got to run the hell out of the sampler (for the publication-quality runs reported in the AusJPS piece I used millions of iterations, which ran overnight, no big deal). Coding up a real DLM (dynamic linear model) via MCMC for this problem in C isn’t hard, I honestly just haven’t had the time to do it, but I think I will finally bite the bullet and include such a thing in my pscl package.
The other note here is identification. In the AusJPS article, I was doing election tracking, ex-post, which means I had the benefit of hindsight and I could exploit the constraint that the latent state is actually not latent (observed!) on election day; this provided an anchor off which the rest of the model (the house effects, the latent state, and the innovation variance parameter) was identified (if barely). In the work I’m doing with Neal Beck and Howard Rosenthal on Bush’s approval ratings, there isn’t a final/ultimate day of truth when the “real” approval number is revealed (a la election day in the case of election tracking), and so we use the identifying constraint that the house effects have mean zero (any “rank one” constraint like that will do).

Comments (3)

polls, journos, and social science

Tuesday July 10, 2007

Filed under: Australian Politics — jackman @ 11:39 am

Journalists write to deadline, to sell papers, and do it every day. Political scientists, well we’re under different sorts of pressures. The “science” part of the job title means we take sample sizes, bias, trends etc seriously, and while we want to be relevant to matters of public importance, we’re not paid to sell papers every day. Conflict can ensue, as I think we’re about to see (Mumble looks like he is about to clobbered by The Australian).

Most of the time, this week’s poll will look just like last week’s poll. Thats the way it has to be when you are in the field with sample sizes of 1,400 or so. So what is a journalist to do. To use a lovely quote from Murray Goot on this score:

The problem in the reporting of the polls does not arise, fundamentally, from the fact that journalists are ill-trained to deal with such data — though that to some extent is true. Fundamentally, the press plays up differences which are otherwise insignificant because it has to. Its only alternative is to say that what a poll found today is not significantly different from what it found yesterday; and under most (though not all) circumstances, that sort of news is no news at all. (Goot 2000, 46).

I dropped those lines verbatim into in the opening of my article on polling in the run-up to the 2004 election, which appeared in the AusJPS. I followed them up with some choice words from Peter Brent (Mumble):

[q]uantitiative opinion polls aren’t that precise. But the process that pays for them pretends they are. They [polls] cost a bundle and so are given pride of place. Once they’re there, everyone involved goes along with the charade. (Brent 2004).

So when this week’s Newspoll reports the same 2PP number as the last Newspoll, don’t be surprised that the journalist tasked with writing it up focuses on the “newsworthy” aspect of the poll; in this case, the bump in the preferred PM numbers. The Australian’s headline on Tuesday and front page photo “led” with this aspect of the poll. A few sentences into the story I read that the 2PP number was unchanged. I felt suckered (and a mouse click later I saw that Mumble had ripped into the way the poll was reported), but Goot’s quote quickly came to mind: the journalist was doing his job (selling papers). Indeed, as the journalist in question put it:

My job is simply to tell people what the most interesting aspect of the latest Newspoll figures are and to put them into the perspective of reporting on those surveys for the last 15 years or so.

For a start, there’s no interest in saying the latest polls haven’t changed – the interest, politically and journalistically, is where has the change occurred. After all, if there were no changes at all in any category there’d be no point in reporting the latest poll.

As for the change referred to, I really don’t know what to read into the preferred PM numbers, maybe it is real, maybe it isn’t (43-42-15, Rudd-Howard-uncommitted, down from, say 46s to 49s for Rudd in Feb to June Newspolls, some more polls/data would be nice), maybe it will translate into an improvement in Coalition vote share, maybe it won’t. We’ll see soon enough (spoken like a true academic…?). I doubt that there is much to be gleaned from going back over previous cycles in seeing if a change in preferred PM leads or lags changes in 2PP. I tend to think the signal to noise (size of the changes relative to sampling error) ratio in such an analysis wouldn’t help us get a clear answer on this: we’re talking about small changes either leading or lagging other small changes, each measured with reasonable amounts of error. Even finding the “turning points” in the respective series might not be straightforward. Some statistical analysis from Possum is here. I wonder if this is the kind of thing that might be more precisely done with the methodology I sketched in the AusJPS article, pooling the polls (pooling the 2PP or 1st preferences results from multiple polls, as well as pooling the preferred PM results, since Newspoll isn’t the only public source of data on either).

I think two things are clear at this stage (and Mumble picked up on both of these): (1) Labor’s primary and 2PP numbers remain high by historical standards, and higher than at any equivalent point in the last couple of cycles; (2) no one believes that this election will be decided by a margin like the 56-44 type of numbers we’ve been seeing for the last 6 months or so.

I hope Mumble doesn’t get too badly roughed up in today’s Australian. We’ll see. For one thing, Peter is my co-author. He already copped a bit of a serve here, which perhaps only goes to show the growing power of non-traditional media like blogs and Crikey. Frankly, I’m surprised that the mainstream media are paying that much attention. Its all good, in the medium-term, long-term. In addition to learning a lot of statistics and political science, I also learned in graduate school that “opinion in good men is but knowledge in the making…” (Milton, Areopagitica).

Follow-up: the editorial of The Australian this morning is a long defence of Newspoll and the Australian’s reporting of Newspoll, clearly in response to Mumble/Crikey under the subtitle “Online prejudice no substitute for real work” and Peter cops a direct serve in the end paragraphs.

Comments (15)
Next Page »

Powered by WordPress