## Monday August 18, 2014

Filed under: computing,R,statistics,type — jackman @ 6:00 am

One of the very exciting and promising developments from RStudio is the rmarkdown/shiny/ggvis combination of tools.

We’re on the verge of static graphs and presentations being as old-fashioned as overhead transparencies.

I’ve spent the last couple of days giving these tools a test spin. Lots of comments and links to examples appear below.

I came to this investigation with a specific question in mind: how can I get a good-looking scatterplot with some rollover/tooltip functionality into a presentation, with one tool or one workflow?

Soft constraints: I’d prefer to use R, at least on the data side, and I would also like customization over look and feel (e.g, slide transitions), stylistic elements like type, color, sizes and spacing.

I use either Beamer or Keynote for presentations (Beamer for teaching/stats-type talks, Keynote for more substantive, general audience talks). I began by investigating how one might drop a d3-rendered graph into a Keynote presentation, but this seems pretty hard. Hacking at the files produced by Keynote’s export-to-HTML function seems formidable.

I’ve also been poking at solutions that are all on the JS side of the ledger (e.g., d3 + stack), inspired by this example from Karl Broman. I’m also interested in how one might roll an interactive graphic into Prezi.

But back to the RStudio workflow, using the rmarkdown/shiny/ggvis combination. Here is some sample output I’ve created: a standalone scatterplot and a dummy presentation.

Some observations:

• Building shiny server for OS X was pretty easy. Nic Ducheneaut has a set of instructions that worked just fine. One slight wrinkle is that I had to manually make a symbolic link to the pandoc executable – at /usr/local/bin/pandoc on my system – from the /usr/local/shiny-server/ext/pandoc directory.
• ggvis sits on top of vega which sits on top of d3. It is rapidly evolving but extremely promising. In about 15 minutes I had a learned enough shiny and ggvis to make a scatterplot with respectable tooltip functionality. Fine-tuning graphical elements took considerably longer and a lot more code: ggvis is still young, rapidly evolving and you are warned not to use it for production yet. I’d agree. Trying to get finer control over graphical elements did reveal some of the alpha-ness of ggvis at this stage. An example of what I was able to make appears here.
• ggvis rollover/tooltip behavior doesn’t seem to be as responsive/reliable/predictable as d3. It is almost as if ggvis/vega can’t resolve individual points in dense regions of a scatterplot. I don’t know why. Comparison: ggvis/shiny and d3.
• Deploying/embedding ggvis/shiny in markdown is straightforward. I found myself using the ioslides_presentation format for output.
• It seems that you’re supposed to be using Chrome’s (full-screen) presentation mode when you present, serving the pages from localhost or a (local/remote) shiny server. My explorations revealed subtle differences between Safari and Chrome on padding, margins, etc. That’s almost surely more to do with “web standards compliance” being a little wobbly between browsers than anything on the RStudio side. ioslides is a Google creation, so better to stay with Chrome.
• shiny itself seems a little buggy. RStudio’s browser sometimes just refused to start, or starts slowly.   “Open in Browser” sometimes simply refuses to work. Until it does. I suspect I should not have Safari as my default browser. I also suspect a conflict if you are simultaneously running an instance of shiny server locally.
• The reveal.js format looked promising, but seems to produce broken output.
• I noticed an odd quirk with widgets, which I reproduce here. For instance, this slider for bandwidth adjustment didn’t display properly out of the box. Just one number on the right of the slider appeared on start-up (the current value), then after sliding, the minimum appears on the left, but the current value still appears on the right. Toggling Chrome into presentation mode (which will be the typical usage) seemed to fix things, as did hitting “Inspect element”. I recall being able to repeat this with different output formats, so I don’t think this is necessarily an ioslides issue.
• The previous example also reveals some text encoding weirdness, the apostrophe in “don’t” is dropped on the title slide.
• Presentations can be served from a (remote) shiny server: simply call the Markdown file index.Rmd, place that and other files in an appropriately named subdir under your shiny server’s file hierarchy, and away you go. Indeed, RStudio has its own deployment and hosting service, shinyapps.io.
• Shiny uses bootstrap’s grid layout which I had to learn a little about to get some control over the size of the ggvis on the slide, but I was still very unsatisfied with my ability to control the size of a graph on the slide.
• I’m yet to play with tables or MathJax.
• I found customization pure agony. Suppose you don’t want Helvetica or Open Sans on a white background. Writing your own CSS seems the most sensible way to deal with this, but this involved tons of “Inspect Element” on the resulting HTML, tweaking the requisite CSS, re-compile, repeat…   Not fun. Shiny graphics have their own CSS from bootstrap. This is customizable too, at least in theory, but I was running of time and energy at that point.
• “woff”. This was a nice gotcha, nothing to do with RStudio packages per se, but I thought I’d remind myself about this with a note here.  If you deploy your own web fonts via some custom CSS, keep in mind that shiny server runs from port 3838.  On my web server I put web fonts in a standard place where I can point back from various apps and services via CSS, typically with a fully-qualified URL. But then you’ll run into cross-domain access issues as your web server (on port 80 or 443 etc) is being asked to serve web fonts to what appears to be a remote server (shiny server from 3838). You’ll need to create a .htaccess file in the subdirectory containing the web fonts, granting access to the shiny-server, or else serve the web fonts out of the local directory where the application lives. This is “an oldie but a goodie” in the webfont world, apparently. Some details here.

If you’re happy with the out-of-the-box style defaults, then this stack of tools is just about there and evolving rapidly. And keep in mind that rmarkdown does a lot more than make presentations. For instance, I’m yet to really explore rmarkdown for producing publish-to-web papers.

If you crave fine control over layout and graphical elements, then I think it might still be a d3/js world, at least for a while longer.

I’m still left thinking that if I could drop shiny apps or d3 into Keynote (somehow), then I’d have the best of both worlds.

## Wednesday July 30, 2014

Filed under: R,statistics — jackman @ 1:04 pm

I’ve updated some of the graphical displays of the ideal point estimates I serve up here. I’ve rendered some of these in d3, with some rollover lah-de-dah: (1) 113th House ideal points in a long “caterpillar” format; (2) scatterplot of ideal point against Obama 2012 vote in district. Screenshot of the scatterplot appears below.

My R scripts dump csv containing the ideal point estimates, credible intervals, labeling info, which I then pick up on the d3 side. Â  Separate files dump fitted values from local regression fitting estimated ideal point as a function of Obama vote in district.

I toyed with the idea of loess on the d3/js side (with sliders for user control of bandwidth etc), more as a plausibility probe than anything, but it seems like a lot to push down through the browser.

## Thursday May 30, 2013

Filed under: Australian Politics,R,statistics — jackman @ 1:04 pm

I’ll be contributing a piece about once a week for the Guardian Australia, under a part of the web site we’re calling The Swing.

The set of graphs from my 1st effort were rendered in-line and rather low-res.

Bigger, full res versions appear below; click on the in-line versions.

It would be great to find a way to quickly make nice, web-friendly graphs out of R. Vega looks like a reasonable wrapper to d3. Datawrapper.de just doesn’t give me enough control over annotations, axes etc… I’m also looking at Rickshaw. Life is short, beautiful graphics are hard, sometimes…

## Friday August 24, 2012

Filed under: R,statistics — jackman @ 7:00 am

From one of the R lists I follow:

Today (2012-08-23) on CRAN [1]:

“Currently, the CRAN package repository features 4001 available packages.”

These packages are maintained by approximately 2350 different folks.

Previous milestones:

2011-05-12: 3,000 packages [1]
2009-10-04: 2,000 packages [2]
2007-04-12: 1,000 packages [3]
2004-10-01: 500 packages [4]
2003-04-01: 250 packages [4]

[1] http://cran.r-project.org/web/packages/
[2] https://stat.ethz.ch/pipermail/r-devel/2009-October/055049.html
[3] https://stat.ethz.ch/pipermail/r-devel/2007-April/045359.html
[4] My private in-house data.
[5] http://cran.r-project.org/web/checks/check_summary_by_maintainer.html

/Henrik

PS. This count includes only packages on CRAN. There are more
packages elsewhere.

Comments Off on CRAN might get tenure at Yale?

## Wednesday April 4, 2012

Filed under: Australian Politics,R,statistics — jackman @ 12:40 am

Labor won 15 of Queensland’s 29 House of Reps seats in the 2007 Federal election (AEC details here). Yet just three years later, in the 2010 Federal election, Labor won only 8 of 30 Queensland Reps seats, with 33.6% of 1st preferences (a swing of -9.3 percentage points).

Labor’s best performance on 1st preferences in 2010 was in Capricornia (46%), which translated into a 54-46 2PP result. Kevin Rudd won Griffith with 44% of 1st preferences, resulting in a 58-42 2PP result. Wayne Swan and the LNP candidate split the 1st preferences in Lilley, 41-41, with Swan winning the seat with Green preferences, 53-47 2PP. Labor managed to get home in Moreton in 2010, with 36% of the 1st preference vote, and a 51-49 2PP result.

The state election of some 10 days ago was conducted under different district boundaries (89 seats in the Queensland parliament) and a different electoral system (optional preferential). Moreover, the Katter Australia Party ran candidates in 76 seats, winning 11.5% of 1st preferences, further complicating comparisons with previous elections (state or federal). In any event, Labor won about 26.7% of 1st preferences (ECQ results), down 6.9 percentage points from its performance in the 2010 Federal election, and down a staggering 15.6 percentage points from the 2009 state election.

How might these 2012 state-level results translate into Federal results?

There are many different ways of looking at this, all of which involve a little guesswork and assumptions given the differences in the two electoral systems, the configuration of parties and so on.

Here’s a stab that I’ve been working on over the last week or so (“Spring Break” here at Stanford). The AEC conveniently (!) geo-codes its polling places and publishes that data on its web site. Shape files for Federal electorates are also available. This makes it feasible to start re-aggregating booth-level results from the state election up to Federal seats.

A few steps and assumptions are required (and I’ll write this up at some point):

• Parse the ECQ’s XML presentation of the 2012 state election results; I used the XML package in R. By the way, it is terrific that both AEC and ECQ put the XML’d version of their results up in real time; reasonably sane schema, relatively easy to parse, etc.
• geo-code the state polling places. ECQ doesn’t put lat/lons of its polling places up on its web site, at least not that I could find. I thought about hacking its Google maps overlay javascript, but that was beyond me, and the maps there seemed to only provide a rough guide as to the actual locations of ECQ polling places.
• My next move was to recall that there is tremendous overlap between state and Federal polling places, at least in metro areas. I wrote some code to look for matches between the strings describing state and Federal polling places. I also wrote some code that asked the Google maps API to return lat/lons of the addresses associated with each state polling place, which turned out to be quite imprecise once you get away from metro areas. But between Google and the AEC geo-codes, I was able to come up with usable geo-codes for 2,100 ECQ polling places (all of the ECQ’s actual polling places). I performed more than a few sanity checks and manual corrections on the geocodes (“visiting” many Qld schools and community halls in Google maps), and actually corrected some of the AEC geocodes too. It is then straightforward to map these geocoded state booths into Federal electoral divisions using functionality in the sp package in R.
• In the 2012 Queensland state election, only 75.7% of ballots were cast at actual polling places on Election Day (ECQ). The remaining ballots were cast using a variety of methods: pre-poll votes, postal votes and Election Day absentees being the three most used methods. Fun fact: 41.2% of Burleigh’s ballots were cast this way, the most of any QLD electorate. I allocated these (state-level) non-standard votes to Federal seats in proportion to the spread of the state seat’s regular, polling-place votes across Federal seats (fun facts: the state seats of Algester, Everton, Maryborough and Springwood each take in 4 Federal seats; 25 of the 89 QLD state seats lie wholly within one of Qld’s 30 Federal seats).
• There is perhaps a little more work to do refining the way I handle state booths that lie outside but very close to a particular Federal seat, say, where that booth is also used in Federal elections and for the Federal seat in question. That is, the AEC is telling us that we’ve got a polling place outside the electorate boundaries; surely some (all?) of the state votes cast at that booth should count towards the estimate we make for the “logical” Federal seat, not the “physical” Federal seat. Some of these booths serve multiple Federal seats, suggesting some kind of proportional allocation heuristic. I’m yet to do this last bit of fiddling; life is short and Spring Break is over…
• Turnout! No one ever talks about this. But get this. The ECQ has 2,468,290 ballots cast, corresponding to 89.9% turnout (2,746,844 total enrolled). In the 2009 state election turnout wound up being 91.0%. In the 2010 Federal election turnout was 92.8% (2,521,574 ballots; 2,719,360 enrolled), down 1.6pp from 2007 (by the way). But the point is that state-level turnout trails Federal by about 2 to 3 percentage points. You wonder about the partisan leanings of those voters not turning out in state elections, but coming out for the Federal election.
• I also wonder how much any effect here might be offset by the differences in informality state to Federal, OPV to full preferential. 5.5% of House votes cast in QLD in the 2010 Federal election were informal; the corresponding figure for the 2012 state election (OPV) is just 2.5%.

So what do you get when do this re-aggregation, subject to all the caveats sounded above? Keep in mind I only have 1st preferences, at least for now.

The figure below (click for full-size) shows a scatterplot of imputed Federal results for the ALP given the 2012 state results, for each of Queensland’s 30 Federal seats, against the ALP’s actual 1st preference vote share (%) recorded in the 2010 Federal election. The diagonal line is a 45 degree line, a “no difference” line. On average, the data points lie below the diagonal, indicating what we know, that Labor did considerably better in the 2010 Federal election than in the 2012 state election.

Red dots and labels indicate the 8 seats won by Labor in 2010. The good news (!?) for Labor is that the Federal seats in which its primary vote utterly cratered are seats in which it had no chance of winning in the 1st place, where its 2010 1st preference vote share was below 30% or barely above 30% (e.g., Wide Bay, Maranoa, Fairfax, Wright, Fisher, Hinkler).

The bad news for Labor is that it would seem that most of its 8 Federal, Queensland seats are at some peril, with the exceptions perhaps being Griffith (Rudd’s seat), and maybe Rankin (Craig Emerson) and Oxley. The estimated ALP 1st preference vote share given the 2012 state results in these 3 seats lies above the actual ALP 1st preference recorded in Moreton in 2010, which was Labor’s weakest among the 8 seats it won in 2010 (and observe the many assumptions implied in that extrapolation).

Lilley — Swan’s seat — will be interesting. I grew up in Lilley on Brisbane’s northside. When Labor is really on the nose, it goes to the Coalition. Swan lost the seat in 1996 in his sophomore election, but has held it since 1998. I’m not sure the last redistribution helped, and its tough to see Labor win it if its primary vote share slips below 35%. Complicating factors are what role might the Katter party play, as well as some kind of “personal vote” for Swan (an incumbent Federal Treasurer, no less).

I also show the implied swings given by these estimates of ALP 1st preference vote share (bigger version available by clicking):

This presentation of the data highlights that Griffith (Rudd’s seat) has the smallest implied swing among Labor’s 8 seats, around about 5 percentage points. Coupled with the fact that Rudd starts off at a tolerable level of 1st preference support, this bolsters confidence that Griffith remains Labor’s best shot at a “retain” in 2013.

The implied swing in Moreton is only a little larger, but there is far less buffer there. Swings of -7 to -8 percentage points on 1st preferences in Lilley, Rankin and Oxley would have to be almost surely fatal to Labor’s chances there. And double digit swings in Petrie, Blair and Capricornia would also have be beyond the margin of survival.

Could Rudd be the last (QLD, Labor) one standing?

Comments Off on Rudd, the last one standing?: Federal implications of QLD state election results

## Saturday October 15, 2011

Filed under: computing,R,statistics — jackman @ 5:01 pm

Update to my pscl package, now on CRAN.

Biggest change: fixing a bug in the way MCMC draws for item parameters were being stored and summarized by ideal.

Comments Off on pscl 1.04 live on CRAN

## Wednesday October 12, 2011

Filed under: R,statistics — jackman @ 3:03 pm

Impressive.

You are not alone!

Comments Off on Bay Area R Users group has 1300 members

## Wednesday July 13, 2011

Filed under: Australian Politics,R — jackman @ 3:23 pm

The header of my blog (above) shows the latest prices on offer in some of Australia’s election betting markets. Â I convert the prices to an implied probability of ALP win (factoring out the bookie’s profit margin, the so-called “overround”).

I’m using some Javascript by John ResigÂ to make Tufte-ish sparklines, although the Google version of sparklines looks easy to work with too. I’m using some R to generate PNG files plotting the last 72 hours of data.

Time-series graphs appear as PDFs too, again see the header of the blog.

On the data themselves, the betting markets have been moving in a pro-Coalition direction over the last two weeks, with some movement around the time that recent polls have been released, showing that the Coalition would romp home. Â I think we’re still waiting on some post-carbon-tax polling, and how the betting markets digest that.

Comments Off on tracking Australian election betting markets again (now with sparklines)

## Wednesday June 29, 2011

Filed under: politics,R,statistics — jackman @ 11:09 am

Now that classes are over, I took a little time to update my scripts that update the analysis of Congressional roll calls in close to real time. Â  Links appear at the top of the blog. Â  As of about 15 minutes ago, we’re up to 77 non-unanimous roll calls in the 112th Senate. Â  The House has 474 non-unanimous roll calls under its belt.

I’m presenting estimates of legislators’ “ideal points” and 95% credible intervals (from a model that fits just a single underlying dimension to the roll calls) both graphically (House/Senate) and in CSV. Â I also present scatterplots (and loess smoothing) of the estimated ideal points against a crude (but useful) measure of preferences in the legislators’ district/state, Obama vote share in the 2008 election (House/Senate). I’ve also got a SVG with rollovers for the dense House scatterplot, using the RSVGTipsDevice package, but the resulting SVG breaks in Chrome.

I’m scraping the roll calls and some meta data from the House and Senate sites, using the parsing in R’s XML package (which I’m finally understanding how to use effectively). Â  Analysis of the roll calls is via the ideal function in (my) R package, pscl.

Quite aside from the methodology/technology, the substantive story is very much business as usual: zero partisan overlap in the recovered ideal point estimates. About 1 to 1.5 standard deviations of the ideal point distribution separate the ideal points of Democrats and Republicans among districts/states that split 50-50 Obama/McCain in 2008.

The other striking feature of the data is how few Democrats remain in the 112th House in districts where McCain beat Obama: I count 12 such seats.

## Monday June 13, 2011

Filed under: R — jackman @ 3:53 pm

Sweave source for the poll report for those who expressed some interest.

You’ll also need this file of R function definitions, utilities.R.

I also wrote a little shell script that calls Sweave and xelatex etc, hacking the Sweave.sh script that ships with R.

