Against all expectation, I was swept up by the enthusiasm of the friendly R community at the fabulous annual UseR meeting, held here at Stanford late last week and co-sponsored by Stanford Medicine's Biomedical Data Science Initiative.
R is a programming language that is a direct descendant of the S language created at Bell Labs in Murray Hill, New Jersey in the 1970s and 1980s. R’s strengths are statistical analysis, data visualization and nearly infinite extensions.
But the essence of R today is openness. Virtually all of R and its extensions are open source. One day over lunch, a man who uses R to analyze the credit worthiness of applicants for subprime mortgages told me somewhat sheepishly, “We aren’t very R-ish,” by which he meant that his company doesn’t share their code.
While R itself requires programming skills, various graphical interfaces make it accessible to the rest of us. I met a professor of programming who is writing a grant that would allow her to teach R to middle schoolers in Houston. The basics of the language are easily handled by 12 and 13 year olds and could lead to jobs down the road. Many of the talks I attended ended with, “And by the way, we are hiring! If you are interested, come up afterwards and we’ll talk.”
So what is R good for? Pretty much anything data-related you can think of. Data modeling, analysis and visualization are the focus. But you can even write and publish books using an R extension called Markdown.
One speaker builds world population projections for the UN, predicting, for example, that the population of the United States will exceed 400 or 500 million by 2100. There’s no sign of a decrease in global population size this century, and the chances of reaching peak population before 2100 is only about 30 percent. The data journalism website FiveThirtyEight uses R for entertainment and education: They've explained to their readers how P hacking works by showing spurious “statistically significant” associations between eating cabbage and having an inny bellybutton and also compared box office success of movies with the movie's Bechdel ratings.
Predicting the immediate course of major disease outbreaks is another challenge for R. An extension for R called outbreaker helps predict the course of an epidemic in real time based on estimates of how many other people are infected by each sick person and similar information. But outbreaker is still too slow, so R users recently hosted a hackathon for emergency outbreak response to contribute work solutions to make outbreak2 faster and more accurate.
I attended three different talks about heatmaps, those colorful charts that show the gene activity of scores or even hundreds of different genes. An upcoming R extension called superheat will offer a convenient way to make great heat maps with lots of options for presenting the data.
And one presenter talked about DataSHIELD — a series of R packages that together allow researchers to analyze patient data from different sources, without disclosing private information. To give a simple example, if a researcher wanted to know the average number of patients with a particular condition, DataSHIELD can query scores of datasets and deliver the results of the calculations without providing access to the actual data. (Big data specialists are increasingly looking for ways to give researchers the ability to let researchers explore electronic health records from hundreds of different health-care systems without giving them direct access to individual records.)
Four days of R talks can't be compressed into a few hundred words. Suffice to say, if you love data, you’ll most likely love the annual UseR Conference. I am neither a programmer nor a statistician, but it was great fun. For more detail, check out UseR's book of abstracts.
The meetings alternate between Europe and the U.S., which means next year they will be back in Europe — in Brussels. I feel so lucky to have stumbled on one of their annual meetings in my own backyard.
Previously: Stanford Medicine conference provided a big look at big data and Stanford online course on statistics and medicine teaches students worldwide how to interpret data
Heatmap image from Wikipedia