Six Questions for Jonathan Carroll, author of Beyond Spreadsheets with R

By Frances Lefkowitz

Jonathan Carroll is a data science consultant providing R programming services. He holds a PhD in theoretical physics.

 

Save 39% off Beyond Spreadsheets with R. Just enter intcarroll into the discount code box at checkout at manning.com.

Say I know how to make an Excel spreadsheet, but I’ve never done a lick of programming; will I be able to get anything out of your book?
Absolutely! In fact, I bet you’re not giving yourself enough credit about the spreadsheet. If you’ve ever added two cells in a spreadsheet then you’ve already been programming. You might have even used the keyboard for this, e.g. =A1+B1. Beyond Spreadsheets with R takes that level of starting point and guides you to using a programming approach to working with data. Once you understand the basic structures, you’ll be able to calculate the average of values, plot them, and discover patterns within them in a reproducible way.
And what kinds of things will I be able to do when I finish the book?
If you’ve been working in a spreadsheet then you’ll be able to formalize all of that. This means you can construct a processing script which performs operations on data, and you can apply that script to different versions of your data without having to redo everything every time. This means less “click this button then select these values,” and more reliable/reproducible processing. Once you know R at the “end of this book” level, the possibilities become endless. I’ve used spreadsheets as a starting point for data, but that’s hardly the only source of input R can work with. With the same tools, you can collect and analyze Tweets, manipulate images, or connect to other systems to do processing. You can even work with R to do things that aren’t “data,” such as build a blog, generate art, or even connect to another programming language.
What do you do with R?
I’ve often been employed to work with R to transform data reliably from one source to another. Real world data is usually messy (missing values, misspellings, odd formats), and while a spreadsheet is a nice way to interact with that data, it’s not productive if you need to do those operations many (maybe thousands) of times. I’ve built models to estimate statistical quantities based on other people’s data, and built tools to help myself and other people inspect the raw and generated data, either as tables, graphs, or reports. I’ve worked with data on fisheries, the electricity market, sports betting odds, and teaching surveys. At the moment I’m working with genomics data to help scientists look for a way to treat cancer. I wrote my book in Rmarkdown, which means you can trust that all of the code output actually derives from the input. I’m also writing my new blog in Rmarkdown and publishing it right from RStudio.
In addition to covering R and RStudio, does your book also serve as a primer on the basics of data science?
The basics for me are “how do I get this data into my analysis?” and “how do I clean up this data?” These are absolutely covered. As for what sort of analysis you might want to do to that data, I have a small preview in the book, but it’s so dependent on the end goal that it’s worth following up later with a more specific resource. The old saying goes that “data cleaning is 90% of data science; the other 90% is doing something with the data,” which hints at the fact that cleaning the data often takes a lot more effort than first assumed.
How do I decide if I need to ditch the old spreadsheet and use R on a project instead?
If you’re trying to decide if a programming solution is right for your task, consider how many times you’re going to need to use your workflow. If it’s just once, then you can maybe get away with your spreadsheet. If you need to do it twice—and that may just be that you need to revisit the one time you did it to confirm it’s correct—then a programming solution is going to be of benefit. Plus you’ll be able to share what you’ve done, with a colleague, a client, a supervisor, or yourself in six months.
Enough about data; let’s talk about beer. Tell me about your home brew. Also, what are your thoughts about fruit in beer?
I’ve been brewing for about fifteen years now, from grain for about half that time. It’s a wonderfully complex process if you really get into the science of it all (water chemistry, thermodynamics, yeast biology …) and the end result is deliciously rewarding. South Australia gets quite hot so I lean towards ales unless I really want to overwork my lagering fridge. I’m not a big fan of fruity beers, though fruity hops, certainly. But I once made a vegemite stout, so who am I to talk?