|By Brian Godsey
This article has been excerpted from Think Like a Data Scientist
Ask good questions—of the data
Data is useful for any number of things, but it can’t do anything on its own. That’s why the questions we ask of it are so important.
The two most dangerous pitfalls a data scientist can stumble into:
- Expecting the data to be able to answer a question it can’t; and
- Asking questions of the data that don’t solve the original problem.
Asking questions that lead to informative answers and—subsequently—improved results is an important and nuanced challenge that deserves much more discussion than it typically receives. The examples of good, or at least helpful, questions are specific in their phrasing and scope, even if they can apply to many types of projects. In the following sub-sections, I attempt to define and describe a “good question” with the intent of delivering a sort of framework or thought process for generating good questions for an arbitrary project. Hopefully you, the reader, will see how it might be possible to think of thoughtful questions to ask of the data.
Good questions are concrete in their assumptions
No question is as tricky to answer as one based on faulty assumptions. A question based on unclear assumptions is a close second. Every question has assumptions, and if those assumptions don’t hold, it can spell disaster for your project. It’s important to think about the assumptions your questions require, and decide if these assumptions are safe. And, for you to figure out if the assumptions are safe, they need to be well-defined and able to be tested.
I briefly worked at a hedge fund. I was in the quantitative research department, and our principal goal was, like with any hedge fund, to find patterns in financial markets that might be exploited for monetary benefit. A key aspect of the trading algorithms I worked with was a method for model selection. Model selection is to mathematical modeling what trying on pants is to shopping: we try many of them, judge them, and then select one or a few, hoping they serve us well in the future.
Several months after I began working at this hedge fund, another mathematician was hired, fresh out of graduate school. She began working directly with the model selection aspect of the algorithms. One day, as we walked to lunch, she began to describe to me how several the mathematical models of the commodities markets had begun to diverge widely from their long-term average success rates. For example, let’s assume that Model A has correctly predicted whether the daily price of crude oil has gone up or down 55% of the time over the last three years. However, in the last four weeks, Model A has been correct only 32% of the time. My colleague informed me that, because the success rate of Model A had fallen below its long-term average, it was bound to pick back up in the coming weeks, and we should bet on the predictions of Model A.
Frankly, I was disappointed with my colleague, but hers was an easy mistake to make. When a certain quantity—in this case the success rate of Model A—typically returns to its long-term mean, it’s known as “mean reversion”, and is a famously contested assumption of many real-life systems, not the least of which are the world’s financial markets.
Innumerable systems exist in this world, which don’t subscribe to mean reversion. Flipping a standard coin is one of them. If you flip a coin 100 times and you see heads only 32 times, do you think you’re going to see more than 50 heads in the next 100 tosses? I certainly don’t, at least not to the point that I’d bet on it. The history of a [fair] coin being tossed doesn’t affect the future of the coin, and commodities markets are, in general, the same way. Granted, there are many funds that find exploitable patterns in financial markets, but these are the exceptions rather than the rule.
The assumption of mean reversion is a great example of a fallacious assumption in a question that you might ask the data. In this case, my colleague was asking, “Will Model A’s success rate return to its long-term average?” and, based on the assumption of mean reversion, the answer is “yes”: mean reversion implies that Model A will be correct more often when it has recently been on a streak of incorrectness. If you don’t assume mean reversion in this case, the answer is “I have no idea”.
It’s extremely important to acknowledge your assumptions—there are always assumptions—and to make sure that they are true, or at least to make sure that your results won’t be ruined if the assumptions turn out to be false. This is easier said than done. One way to accomplish this is to break down the reasoning between your analysis and your conclusion into specific logical steps, and to make sure the gaps are filled in. In the case of my former colleague, the original steps of reasoning were:
- The success rate of Model A has recently been relatively low.
- Therefore, the success rate of Model A will be relatively high for the near future.
The data tell us (1), and then (2) is the conclusion we draw. If it isn’t obvious that there’s a missing logical step in these steps, it might be easier to see it when we replace the success rate of Model A with an arbitrary quantity X that might go up or down over time:
- X has gone down recently.
- Therefore, X will go up soon.
Think of the things X could be: stock price; rainfall; grades in school; bank account balance. For how many of these does the above logic make sense? Is there a missing step? I’d argue that there is indeed a missing step. The logic should be:
- X has gone down recently.
- Because X always corrects itself towards a certain value, V,
- X will go up soon, towards V.
Note that the data have told us (1), as before, and we’d like to be able to draw the conclusion in (3), but (3) is dependent on (2) being true. Is (2) true? Again, think of the things X could be. Certainly, (2) isn’t true for a bank account balance, or rainfall; it can’t always be true. We must ask ourselves if it’s true for the particular quantity we’re examining – do we have reason to believe that, for an arbitrary period, Model A should be correct in its prediction 55% of the time? In this case, the only evidence we have that Model A is correct 55% of the time is that Model A, historically, has been correct 55% of the time. This is something like circular reasoning, which isn’t enough real evidence to justify the assumption. Mean reversion shouldn’t be taken as truth, and the conclusion that Model A should be correct 55% of the time (or more) in the near future isn’t justified.
As a mathematician, I’ve been trained to separate analysis, argument, and conclusion into logical steps, and this experience has proven itself invaluable in making and justifying real-life conclusions and predictions through data science. Formal reasoning is probably the skill I value the most amongst those I learned through my mathematics course work in college. An important fact about reasoning is (to again emphasize the point I’m trying to make in this section): a false or unclear assumption starts you out in a questionable place and every effort should be made to avoid relying on such false assumptions.
Good answers: measurable success without too much cost
Perhaps shifting focus to the answers to “good questions” can shed more light on what makes a question good, as well as help you decide if your answers are sufficient. The answer to a good question should measurably improve the project’s situation. You should be asking questions that, whatever the answer, make your job easier by moving you closer to a practical result.
How do we know if answering a question will move us closer to a useful, practical result? I can return to the idea that one of a data scientist’s most valuable traits is: awareness of what might occur combined with the ability to prepare for it. If you can imagine all (or at least most) possible outcomes, then you can follow the logical conclusions from them. If you know the logical conclusions—the additional knowledge that can be deduced from your new outcome—then you can figure out if they’ll help you with the goals of your project.
There can be a wide range of possible outcomes, many of which can be helpful. Though this isn’t an exhaustive list, you can move closer to the goals of your project if you ask and answer questions that lead to:
- Positive OR negative results;
- Elimination of possible paths or conclusions; or
- Increasing situational awareness.
Both positive and negative results can be helpful. What I call “positive” results are those that confirm what you suspected and/or hoped for when you initially asked the question. These are helpful because they fit into your thought processes about the project, and also move you directly towards your goals. After all, goals are basically yet-unrealized positive results which, if confirmed, give some tangible benefit to your customer.
Negative results are helpful because they inform you that something you thought true is false. These results usually feel like setbacks, but, practically speaking, they’re the most informative results. What if you found out that the sun wasn’t going to rise tomorrow, despite historical evidence to the contrary? This is an extreme example, but can you imagine how informative it is, if confirmed true? It changes everything, and you’re likely one of few people who knows it, given that it’s counter-intuitive. In that way, negative results can be the most helpful, though often they require you to re-adjust your goals based on the new information. At the least, negative results force you to re-think your project to account for those results, a process that leads to more informed choices and a more realistic path for your project.
Data science is fraught with uncertainty. Many paths exist to a solution, many paths to failure, and even more paths lead to the gray area between success and failure. Evidence of improbability or outright elimination of any of these possible paths or conclusions can be helpful to inform and focus the next steps of the project. A path can be eliminated or deemed improbable in many ways, which might include:
- New information making a path far less likely;
- New information making other paths far more likely;
- Technical challenges that make exploring certain paths extremely difficult or impossible.
If eliminating a path doesn’t seem like it’s helping—maybe it was one of the only paths that might have succeeded—keep in mind that your situation has become simpler regardless (which can be good). Or, you have the chance to re-think your set of paths and your knowledge of the project. Maybe there’s more data, more resources, or something else you haven’t thought of that might help you gain a new perspective on the challenges.
In data science, increasing situational awareness is always good. What you don’t know can hurt you, because an unknown quantity will sneak into some aspect of your project and ruin the results. A question can be good if it helps you gain insight into how a system works or what peripheral events are occurring that affect the data set. If you find yourself saying, “I wonder if…” at some point, or if a colleague does the same, ask yourself if that thought relates to a question that can help you gain some context for the project. If not, look for an answer to some larger, more direct question. Being introspective in this way brings some formality and procedure to the often-fuzzy task of looking for “good” results.