|From Think Like a Data Scientist by Brian Godsey
In this article, we’re going to discuss the importance of identifying and reviewing any assumptions you might have about the data you’re working with.
Checking assumptions about the data
Whether we like to admit it (or not), we all make assumptions about data sets. We might assume that our data are contained within a particular time-period. Or, we might assume that the names of the folders that contain emails are appropriate descriptors of the topics or classifications of those emails. These assumptions about the data can be expectations or hopes, conscious or subconscious.
Assumptions about the contents of the data
Let’s consider the element of time in the example of Enron data. I certainly assumed, when I began looking at the data, that the emails would span the few years between the advent of email in the late 1990s and the demise of the firm in the early 2000s. I’d have been mistaken, due to the potential errors or corruption in the date formatting. In practice, I saw dates far outside the range that I assumed, as well as some other dates that were questionable. My assumption about the date range was certainly one that needed to be checked.
If we want to use the folder names in the email accounts to inform us about the contents of emails within, there’s an implied assumption that these folder names are informative. We’d want to check this, which would likely involve a fair amount of manual work, such as reading a bunch of emails and using your best judgment about whether the folder name describes what’s in the email.
One specific thing to watch out for is missing data or place-holder values. We tend to assume—or at least hope—that all fields in the data contain a usable value. But, often emails have no subject, or lack a name in the “from” field, or in CSV data there might be “NA”, “NaN”, or a blank space where a number should be. It’s always a good idea to check to see if such place-holder values occur often enough to cause problems.
Assumptions about the distribution of the data
Beyond the contents and range of the data, we may have further assumptions about its distribution. In all honesty, I know a lot of statisticians who get excited about the heading of this section, but are disappointed with its contents. Statisticians love to check the appropriateness of distribution assumptions. Try Googling “normality test” or go straight to the Wikipedia page and you’ll see what I mean. It seems there are about a million ways to test whether your data are normally distributed, and it’s one statistical distribution.
I’ll probably be banned from all future statistics conferences for writing this, but: I’m not usually that rigorous. Generally, plotting the data using a histogram or scatter-plot can tell you whether the assumption you want to make is reasonable. For example, the figure 1 illustration is a graphic from one of my research papers in which I analyzed performances in track and field. Pictured is a histogram of the best men’s 400 m performances of all-time (after taking their logarithms) and overlaid is the curve of a normal distribution. The top performances fit the tail of a normal distribution, which was one of the key assumptions of my research, and I needed to justify that assumption. I didn’t use any of the statistical tests for normality, partially because I was dealing with the tail of the distribution—only the best performances, not all performances in history—but also because I intended to use the normal distribution unless it was obviously inappropriate for the data. To me, visually comparing the histogram with a plot of the normal distribution sufficed as verification of the assumption. The histogram was similar enough to the bell curve for my purposes.
Figure 1 The logarithms of the best men’s 400 m performances of all time seem to fit the tail of a normal distribution.
Though I may have been less than statistically rigorous with distribution of the track and field data, I don’t want to be dismissive of the value of checking the distributions of data. Bad things can happen if you assume you have normally-distributed data when you don’t. Statistical models that assume normal distributions don’t handle outliers well, and the majority of popular statistical models make some sort of assumption of normality. This includes the most common kinds of linear regression as well as the t-test. Assuming normality when your data isn’t even close can make your results appear significant, when in fact they’re insignificant or wrong.
This last statement is valid for any statistical distribution, not only the normal. You may have categorical data that you think is uniformly distributed, when in fact some categories appear far more often than others. Social networking statistics, such as the kind I’ve calculated from the Enron data set—number of emails set, number of people contacted in a day, etc.—are notoriously non-normal. They’re typically something like exponentially or geometrically distributed, both of which I’d also check against the data before assuming.
All in all, it might be OK to skip a statistical test for checking that your data fits a distribution, but be careful and make sure that your data match any assumed distribution, at least roughly. Skipping this step can be catastrophic for results.
A handy trick for uncovering your assumptions
If you feel like you don’t have assumptions, or you’re not sure what your assumptions are, or even if you think you know your assumptions, try this:
Describe your data and project to a friend—what’s in the data set and what you’re going to do with it—and write down your description. Then, dissect your description, looking for assumptions.
For example, my original project involving the Enron data I might describe as:
“My data set is a bunch of emails, and I’m going to establish organization-wide patterns of behavior over the network of people using techniques from social network analysis. I’d like to draw conclusions about things like employee responsiveness as well as communication “up” the hierarchy, i.e. with a boss.”
In dissecting this description, I first identify phrases and then I think about what assumptions might be lurking beneath them, as in:
- “My data set is a bunch of emails” – It’s probably true, but it might be worth checking to see if there might be other non-email data types, such as chat messages or call logs.
- “organization-wide” – What is the organization? Are we assuming it’s clearly defined, or are there fuzzy boundaries? It might help to run some descriptive statistics regarding the boundaries of the organization, possibly people with a certain email address domain, or people who wrote more than a certain number of messages.
- “patterns of behavior” – What assumptions do you have about what constitutes a “pattern of behavior”? Does everyone need to engage in the same behavior for it to be declared a “pattern”, or do we have a set of patterns that we compare with individual examples to find a match with those patterns?
- “network of people” – Does everyone in the network need to be connected? Can there be unconnected people? Are we planning on assuming a certain statistical model from social network analysis literature; does it require certain assumptions?
- “responsiveness” – What are we assuming that this term means? Can we define it statistically and verify that the data supports such a definition by using the basic definition along with some descriptive statistics?
- “hierarchy” – Are we assuming we have complete knowledge of the organization’s hierarchy? Do we assume that it’s rigid, or does it change?
Realizing when we’re making assumptions—by dissecting our project description and then asking such questions—can help avoid many problems later. I wouldn’t want to find out that a critical assumption was false only after I’d completed my analysis, found odd results, and then went back to investigate. Even more, I wouldn’t want a critical assumption to be false and never notice it.
That’s all for this article!