- My book: Bayesian Analysis for the Social Sciences (Wiley; Amazon).
- 111th U.S. Senate ideal point estimates:
- 111th U.S. House ideal point estimates:
- Next Australian election, Centrebet prob. of ALP win: 0.78 (2010-02-10, time-series)
I’m speaking at a conference at Berkeley sponsored by the American Political Science Association on “Democracy Audits and Governmental Indicators”. In getting some remarks together — on the the reliability of country-level measures of democracy etc — I wanted to compare the performance of measures of democracy against things like GRE scores, legislative ideal points.
ETS has a background document providing some technical data on GRE scores. The standard deviation of GRE-V scores issued in the 2003-08 period is 121 points, while the GRE-Q scores have a standard deviation of over 150 points. The standard errors of measurement are pretty small, relative to this cross-subject variation in the scores, and surprisingly uniform over the range of scores.
Usually you get a U-shape relationship between standard errors of measurement (or — if you are a Bayesian — standard deviations of marginal posterior densities of latent scores) and the scores; we have greater uncertainty about test subjects in the tails of the ability distribution, since the test items tend to be less informative about those subjects (as they rack up a lop-sided pattern of right/wrong answers).
The administered-by-computer, adaptive, version of the GREs helps smooth out that U-shape, with the computer administering items that have “cut-points” close to the running estimate of the subject’s ability.
To look at this I plotted the “conditional standard errors of measurement” for the GREs (as reported by the ETS) against scores; see below.
There is something of an inverted U, which is weird. We’re actually getting less precision in the middle of the scales than in the tails. The other thing is that we’ve got standard errors of measurement that are about 20%-35% of the between-subject score variation, which tails away to about 5-15% in the upper tails.
I wish those standard errors of measurement were smaller, and that is really only a function of the length of the test, given that ETS has near-perfect knowledge of the item parameters. So, does the GRE need to be longer?
RSS feed for comments on this post.
Sorry, the comment form is closed at this time.
Powered by WordPress
Bad Behavior has blocked 397 access attempts in the last 7 days.
In a traditional two-parameter IRT, the estimated ability for someone who got a perfect score would be the maximum, and anyone with an imperfect score would have a strictly lower ability. And the person who misses everything would have the absolute lowest possible ability estimate, with everyone who got at least one question right scoring above that (assuming poisitive discrimination for each item).
But GRE scores and other standardized tests that try to be comparable over time throw a wrench in the works. The reported GRE scale is truncated from the actual scale score (or more properly the transformation from the IRT scale to the 200-880 scale) – you can get some questions right and earn a 200, and you can get some questions wrong and earn an 800. So the conditional SEM is also reduced at the tails because a true scaled “180″ is reported as a 200 and a true scaled “830″ will be reported as an 800.
They’re also doing a three-parameter IRT (with a guessing parameter) on the GRE, unlike the other tests which would be a two-parameter IRT, but I don’t think that would produce the inverted U on its own.
Comment by Chris Lawrence — Friday October 30, 2009 @ 7:32 pm
“So, does the GRE need to be longer?” poor applicants…
Comment by Antonio — Friday October 30, 2009 @ 7:44 pm
Not weird if you transform the scores. Say you apply a logistic function to a score of 7 and a standard error of 1, it then becomes almost 1 with almost no standard error. They probably then linearly transform the values to the range 200 to 800.
Comment by Ken — Saturday October 31, 2009 @ 1:21 am
Thanks Chris, thanks Ken. Non-linear transforms from the output of the IRT model to the reported scales has got to be the culprit. NIce catch.
Comment by jackman — Saturday October 31, 2009 @ 2:27 pm
As someone currently suffering through the GRE experience, I find this plot enthralling – especially the slope of the verbal plot for roughly 600 < x < 770. For me, this highlights two things:
1) the extent to which the GRE verbal is a vocabulary drill,
2) how the GRE doesn't tell you much more than the time someone (had available to) put into studying for it.
A refresher on how the GRE works. The test-taker's score is largely determined by his performance on the first ten questions. By number #11, the computer is said to have "adapted" to the test-taker's level of ability. His score is unlikely to change much unless he gets the remaining questions consistently right or wrong.
Now, on the verbal section, analogies and antonyms (i.e. vocabulary) tend to predominate among the first ten questions. There are strategies available for reducing the set of possibly correct answers, but these are less effective for antonyms than analogies. In general, though, if you don't know the word, you can't peg the answer with certainty. So the score you get on the verbal section is largely a function of the words one sees in the order he sees them.
Hence the slightly increasing but not-too-far-from-zero slope for the identified range. My hypothesis is that these people have learned enough words to see statistically harder words with each passing question, but they do not know enough words to know all those statistically harder words. Only when one has "memorized the dictionary," as it were, does the conditional s.e. drop precipitously (from 770 < x < 800).
The yield of the GRE assay is a lugubrious surrogate for the analytical adroitness of the subject.
Comment by Jack — Tuesday November 10, 2009 @ 2:21 pm
I tend to think Ken’s observation nailed it re what is happening to the standard errors at the tails of the reported scale.
The issue about test prep masking ability is different to the r’ship between the CSEs and reported score, IMHO.
Comment by jackman — Tuesday November 10, 2009 @ 5:02 pm