|From Think Like a Data Scientist by Brian Godsey
In this article, you will learn about how to choose statistical software tools – what’s important to look for and the things you should consider while choosing the software that’s right for the job.
Choosing statistical software tools
A broad range of tools are available to implement statistical methods. You can compare methods or models with the software options available to implement them, and arrive at a good option or two. In choosing software tools, there are various things to consider, and some general rules that I follow. I’ll outline those here. If you’re looking for a statistical sofware to help you with any projects, you may want to check out something like Together, a mentoring software that aims to provide a seamless experience for employees.
Does the tool have an implementation of the methods?
You can code the methods yourself, but if you’re using a common method, there are many tools that already have an implementation, and it’s probably better to use one of those. Code that’s been used by many people is usually relatively error-free compared to code that you wrote in a day and used only once or twice.
Depending on your ability to program and your familiarity with various statistical tools, there may be a readily-available implementation in one of your favorite tools that you could use quickly. If Excel has it, then most likely every other tool does, too. If Excel doesn’t, then maybe the mid-level tools do, and if they don’t, you’re probably going to have to write a program. Otherwise, you’ll choose a different statistical method.
If you decide to go with a programming language, remember that not all packages or libraries are created equal. Make sure that the programming language and the package can do exactly what you intend. It might be helpful to read the documentation or some examples that are like the analysis you want to do.
Flexibility is good
In addition to being able to perform the main statistical analysis, it’s often helpful if a statistical tool can perform some related methods. Often, you’ll find that the method you chose doesn’t quite work as well as you’d hoped, and what you learned in the process leads you to believe that a different method might work better. If your software tool doesn’t have any alternatives, then you’re either stuck with the first choice or you’ll have to switch to another tool.
For example, if you have a statistical model and you want to find the optimal parameter values, you’ll be using a likelihood function and an optimization technique. A few types of methods might work for finding optimal parameters from a likelihood function, including maximum likelihood (ML), maximum a posteriori (MAP), expectation-maximization (EM), and variational Bayes (VB). Excel has a few different specific optimization algorithms, (they’re all ML methods) and if you think you can get away with ML, but you’re not sure, you might want to level-up to a more sophisticated statistical tool that has more options for optimization.
There are multiple types of regression, clustering, component analysis, and machine learning, and some tools may offer one or more of those methods. I tend to favor the statistical tools that offer a few from each of these methods categories in case I find the need to switch, or try another.
Informative is good
Awareness in the face of uncertainty is a primary aspect of data science; this carries into selection of statistical software tools. Some tools might give good results, but don’t provide insight into how and why those results were reached. On one hand, it’s good to be able to de-construct the methods and the model to understand the model and the system better. On the other hand, if your methods make a “mistake” in some way, and you find yourself looking at a weird, unexpected result, then more information about the method and its application to your data can help you diagnose the specific problem.
Some statistical tools, particularly higher-level ones like statistical programming languages, offer the capability to see inside nearly every statistical method and result – even “black box” methods like machine learning. These insides aren’t always user-friendly, but at least they’re available. It’s my experience that spreadsheets like Excel don’t offer a lot of insight into their methods, and it’s difficult to de-construct or diagnose problems for statistical models which are more complicated than, say, linear regression.
Common is good
data = data.frame(X1 = c( 1.01, 1.99, 2.99, 4.01 ), X2 = c( 0.0, -2.0, 2.0, -1.0 ), y = c( 3.0, 5.0, 7.0, 9.0 )) linearModel <- lm(y ~ X1 + X2, data) summary(linearModel) predict(linearModel,data)
With many things in life --- music, television, film, and news articles --- popularity doesn’t always indicate quality – often the contrary. With software, more people using a tool means there have been more people that tried it, got results, examined the results, and probably reported problems, if any. In that way, software, like open-source software, has a feedback loop that fixes mistakes and problems in a timely fashion. The more people participating in this feedback loop, the more likely a piece of software is relatively bug-free and otherwise robust.
This isn’t to say that the most popular things right now are the best. Software has trends and fads, like everything else. I tend to look at popularity over the past few years of use by people who’re in a similar situation to myself. In a general popularity contest of statistical tools, Excel would obviously win. But, if we consider only data scientists, and maybe only data scientists in a specific field --- excluding accountants, finance professionals, and other semi-statistical users --- we’d probably see its popularity fade in favor of the more serious statistical tools.
For me, the criteria that a tool must meet are:
- The tool must be at least a few years old;
- The tool must be maintained by a reputable organization;
- Forums, blogs, and literature must show that many people have been using the tool for quite some time, and without too many significant problems recently.
Well-documented is good
In addition to being in common use, a statistical software tool should include comprehensive and helpful documentation. It's frustrating when I'm trying to use a piece of software and I have a question that should have a straight-forward answer, but I can't find that answer anywhere.
It's a bad sign if you can't find answers to some big questions, such as how to configure inputs for doing linear regression, or how to format the features for machine learning. If the answers to big questions aren't in the documentation, then it's going to be even harder to find answers to the more particular questions that you’ll inevitably run into.
> summary(linearModel) Call: lm(formula = y ~ X1 + X2, data = data) Residuals: 1 2 3 4 -0.02115 0.02384 0.01500 -0.01769 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.001542 0.048723 20.556 0.03095 * X1 1.999614 0.017675 113.134 0.00563 ** X2 0.002307 0.013361 0.173 0.89114 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.03942 on 1 degrees of freedom Multiple R-squared: 0.9999, Adjusted R-squared: 0.9998
Documentation is usually a function of the age and popularity of the software. The official documentation for the tool should be on the maintaining organization's web page, and it should contain informative instructions and specifications in plain language that is easy to understand. It's funny how many software organizations don't use plain language in their documentation, or make their examples overly complicated. Perhaps it's my aversion to unnecessary jargon, but I shy away from using software whose documentation I don't readily understand.
Like with determining if a tool is common enough, I also check forums and blog posts to determine whether there are sufficient examples and questions with answers that support the official documentation. No matter how good the documentation is, there are likely gaps and ambiguities, and it’s helpful to have informal documentation as back-up.
Purpose-built is good
Some software tools or packages are built for a specific purpose, and other functionality is added on later. For example, the matrix algebra routines in MATLAB and R were of primary concern when the languages were built, and it's safe to assume that they're comprehensive and robust. In contrast, matrix algebra wasn't of primary concern in the initial versions of Python and Java, and these capabilities were added later in the form of packages and libraries. This isn't necessarily bad; Python and Java have robust matrix functionality now, but the same can't be said for every language that claims to be able to handle matrices efficiently.
In cases where the statistical methods I want are a package, library, or add-on to a software tool, then I place the same scrutiny on that package as I would on the tool itself: is it flexible, informative, commonly-used, well-documented, and otherwise robust?
Interoperability is good
Interoperability is a sort of converse of being purpose-built, but they're not mutually exclusive. Some software tools play well with others, and in these you can expect to be able to integrate functionalities, import data, and export results, all in generally-accepted formats. This, of course, is helpful in projects where other software is used for related tasks.
If you’re working with a database, it can be helpful to use a tool that interacts with the database directly. If you're going to build a web application based on your results, choose a tool that supports web frameworks, or at least one that can export data in JSON or some other web-friendly format. Or, if your statistical tool will be used on various types of computers, ensure the software is able to run on the various operating systems. It's not uncommon to integrate a statistical software method into a completely different language or tool. If this is the case, it’s good to check if, for example, you can call Python functions from Java (you can, with some effort).
R was purpose-built for statistics, and interoperability was somewhat of an afterthought, though there’s a vast ecosystem of packages supporting integration with other software. Python was built as a general programming language, and statistics was an afterthought, but as I've said, the statistical packages for Python are some of the best available. Choosing between them and others is a matter of vetting all languages, applications, and packages you intend to use.
import statsmodels.regression.linear_model as lm X = [ [ 1.01, 0.0, 1 ], [ 1.99, -2.0, 1 ], [ 2.99, 2.0, 1 ], [ 4.01, -1.0, 1 ] ] y = [ 3.0, 5.0, 7.0, 9.0 ] linearModel = lm.OLS(y,X) results = linearModel.fit() results.summary() results.predict(X)
Permissive licenses are good
Most software has a license, either explicit or implied, that states which restrictions or permissions exist on the use of the software. Proprietary software licenses are usually obvious, but open-source licenses often aren't as clear.
If you're using commercial software for commercial purposes, that’s fine, but it can be legally risky to do the same with an “academic” or “student” license. It can also be dangerous to sell commercial software, modified or not, to someone else without confirming that the license doesn't prohibit this.
When I do data science using an open-source tool, my main question is whether I can create software using this tool and sell it to someone without divulging the source code? Some open-source licenses allow this, and some don't. It's my understanding (though I'm not a lawyer) that I can't sell an application that I've written in R without also providing the source code; in Python and Java, this is generally permitted, and this is one reason why production applications aren’t generally built in R and other languages with similar licenses. Of course, there are usually legal paths around this, such as hosting the R code yourself and providing its functionality as a web service or similar. In any case, it's best to check the license and consult a legal expert if you suspect you might violate a software license.
Knowledge and familiarity are good
I put this general rule last, though I suspect that most people, myself included, consider it first. I'll admit: I tend to use what I know. There might be nothing wrong with using the tool you know best, if it does reasonably well with the previous rules. Python and R, for example, are good at almost everything in data science, and if you know one better than the other, use that one.
On the other hand, there are many tools out there that aren't the right tool for the job. Trying to use Excel for machine learning, for example, isn't usually the best idea, though I hear this is changing as Microsoft expands its offerings. In cases like this, where you might be able to get by with a tool that you know, it's worth considering learning one that is more appropriate for your project.
In the end, it's a matter of balancing the time you'll save by using a tool you know against the time and quality of results you'll lose by using an inappropriate tool. The time constraints and requirements of your project are often the deciding factor.
That’s all for this article!