From Think Like a Data Scientist by Brian Godsey

Every project in data science has a customer. Sometimes the customer is someone who pays you or your business to do the project – for example, a client or contracting agency. In academia, the customer might be a laboratory scientist who has asked you to analyze their data. Sometimes the customer is you, your boss, or another colleague and its important to set good goals.


Save 37% on Think Like a Data Scientist. Just enter code fccgodsey into the discount code box at checkout at manning.com.


No matter who the customer might be, they have some expectations about what they might receive from you, the data scientist, who has been given the project.

Often, these expectations relate to:

  1. questions that need to be answered or problems that need to be solved;
  2. a tangible final product, such as a report or software application; or,
  3. summaries of prior research or related projects and products

Expectations can come from just about anywhere. Some are hopes and dreams, and others are drawn from experience or knowledge of similar projects. However, a typical discussion of expectations boils down to two sides: what the customer wants vs. what the data scientist thinks is possible. This could be described as wishes vs. pragmatism, with the customer describing their desires, and the data scientist approving, rejecting, or qualifying each one based on apparent feasibility. On the other hand, if you’d like to think of yourself, the data scientist, as a genie, a Granter of Wishes, you wouldn’t be the first to do so!


Resolving wishes and pragmatism

With regards to the customer’s wishes, they can range from completely reasonable to utterly outlandish, and this is OK. Much of business development and hard science is driven by intuition. That is to say, CEOs, biologists, marketers, and physicists alike use their experience and knowledge to develop theories about how the world works. Some of these theories are backed by solid data and analysis, but others come more from intuition, which is basically a conceptual framework that the person has developed while working extensively in their field. A notable difference between many fields and data science is that in data science, if a customer has a wish, even an experienced data scientist may not know if it is possible. A software engineer usually knows what tasks software tools are capable of performing, and a biologist knows more or less what the laboratory can do, but a data scientist that has not yet seen or worked with the relevant data is faced with a large amount of uncertainty, principally about what specific data is available and about how much evidence it can provide to answer any given question. Uncertainty is, again, a major factor in the data scientific process, and should be kept at the forefront of your mind when talking with customers about their wishes.

For example, during the few years that I worked with biologists and gene expression data, I began to develop my own conceptual ideas about how RNA is translated from DNA, and how strands of RNA float around in a cell and interact with other molecules. I am a visual person, so I often found myself picturing a strand of RNA comprising hundreds or maybe thousands of nucleotides, each one appearing like one of four letters representing a base compound (A, C, G, or T; I’ll use “T” in place of “U” for convenience), and the whole strand looking like a long, flexible chain – a sentence that makes sense only to the machinery within the cell. Because of the chemistry of RNA and its nucleotides, complementary sequences like to bind to one another; A likes to bind to T, and C likes to bind to G. So, when two strands of RNA contain near-complementary sequences, they may very well stick to one another. A single strand of RNA might also fold in upon and stick to itself if it is flexible enough and contains mutually complementary sequences. This is a conceptual framework that I’ve used on many occasions to make guesses about the types of things that can happen when a bunch of RNA is floating around in a cell.

Thus, when I began to work with microRNA data, it made sense to me that microRNA—short sequences of about 20 nucleotides—might bind to a section of a genetic mRNA sequence (i.e. RNA translated directly from a strand of DNA corresponding to a specific gene, which is typically much longer) and inhibit other molecules from interacting with the gene’s mRNA, effectively rendering that gene sequence useless. It made conceptual sense to me that one bit of RNA can stick to a section of genetic RNA and simply end up blocking another molecule from sticking to the same. This concept is supported by scientific journal articles and hard data showing that microRNA can inhibit expression or function of genetic mRNA if they have complementary sequences.

However, a professor of biology I was working with had a much more nuanced conceptual framework describing how he saw this system of genes, microRNA, and mRNA. In particular, he had been working with the biology of Mus musculus—a common mouse—for decades, and could list any number of notable genes, their functions, related genes, and physical systems and characteristics that are measurably affected if one begins to do experiments that “knock out” said genes. Because the professor knew more than I will ever know about the genetics of mice, and because it would be impossible for him to share all of his knowledge with me, it was incredibly important for us to talk through the goals and expectations of a project prior to spending too much time working on any aspect of it. Without his input, I would be, essentially, guessing at what the biologically-relevant goals are. If I was wrong, which was likely, work would be wasted. For example, certain specific microRNAs have been very well studied, and are known to accomplish very basic functions within a cell, and little more. If one of the goals of the project was to discover new functions of little-studied microRNAs, we would probably want to exclude certain families of microRNAs from the analysis. If we didn’t exclude them, they would most likely just add to the noise of an already very noisy genetic conversation within a cell. This is merely one of a large number of important things that the professor knew, but I did not. Thus, a lengthy discussion of goals, expectations, and caveats is necessary before starting any project in earnest.

In a broad (if simple) sense, a project can be deemed successful if and only if the customer is satisfied with the results. Of course, there are exceptions to this guideline, but regardless, it is important always to have the expectations and goals in mind during every step of a data science project. Unfortunately, in my own experience, expectations are not usually clear or obvious at the very beginning of a project, or they are not easy to formulate concisely. So, I’ve settled on a few practices that help me figure out reasonable goals that can guide me through each step of a project involving data science.


The customer is probably not a data scientist

A funny thing about customer expectations is that they may not be appropriate. It’s not always—or even usually—the customer’s fault, because the problems that data science addresses are inherently complex, and if the customer understood their own problem fully, they likely would not need a data scientist to help them. So, I always cut customers some slack when they are unclear in their language or understanding, and I view the process of setting expectations and goals as a joint exercise that could be said to resemble conflict resolution or relationship therapy.

You, the data scientist, and the customer share a mutual interest in completing the project successfully, but the two of you likely have different specific motivations, different skills, and most importantly, different perspectives. Even if you, yourself, are the customer, you can think of yourself as having two halves, one (the data scientist) that is focused on getting results and another (the customer) that is focused on using those results to do something “real”, or external to the project itself. In this way, a project in data science begins by finding agreement between two personalities, two perspectives that, if they aren’t conflicting, are at the very least disparate.

While there is not, strictly speaking, a conflict between you and the customer, sometimes it can seem that way as you both muddle your way towards some semblance of a set of goals that are both achievable (for the data scientist) and helpful (for the customer). And, just like in conflict resolution and relationship therapy, there are feelings involved. These feelings can be ideological and driven by personal experience, preference, or opinion, and may not make sense to the other party. Thus, a little patience and understanding, without too much judgment, can be extremely beneficial to both of you, and, more importantly, to the project.


Asking very specific questions to uncover fact, not opinions

When a customer is describing a theory or hypothesis about the system that you are to investigate, they will almost certainly express a mixture of fact and opinion, and it can often be important to distinguish between the two. For example, in a study of cancer development in mice, the aforementioned biology professor told me, “It is well-known which genes are cancer-related, and this study is concerned with only those genes, and the microRNAs that inhibit them.” One might be tempted to take this statement at face value and analyze data from only the cancer-related genes, but this could be a mistake, because there is some ambiguity in the statement. Principally, it is not clear whether other supposedly non-cancer-related genes can be involved in auxiliary roles within the complex reactions incited by the experiments, or if it is well-known and proven that the expression of cancer-related genes is entirely independent of other genes. In the case of the former, it would not be a good idea to ignore the data corresponding to non-cancer-related genes, whereas in the case of the latter, it might be a good idea. Without resolving this issue, it is not clear which is the appropriate choice. Therefore, it is important to ask.

It is also important that the question itself be formulated in a way that the customer understands.  It would not be wise to ask, for example, “Should I just ignore the data from the non-cancer-related genes?” This is a question about the practice of data science in this specific case, and falls under your domain, not the biologist’s. You should ask, rather, something similar to, “Do you have any evidence that the expression of cancer-related genes is independent, in general, of other genes?” This is a question about biology, and hopefully the biology professor would understand it.

In his answer, it is important to distinguish between what he thinks and what he knows. If the professor merely thinks that the expression of these genes is independent of others, then it is certainly something to keep in mind throughout the project, but you should not make any very important decisions—such as ignoring certain data—based on it. If, on the other hand, the professor can cite scientific research supporting his claim, then it is absolutely advisable to use this fact to make decisions.

In any project, you, the data scientist, are an expert in statistics and in software tools, but the principal subject matter expert is very often someone else, as in the case involving the professor of biology. In learning from this subject matter expert, you should ask questions that not only give you some intuitive sense of how the system under investigation works, but also attempt to separate fact from opinion and intuition. Basing practical decisions on fact is always a good idea, but basing them on opinion can be dangerous. The maxim, “Trust, but confirm,” is appropriate here. If I had ignored any of the genes in the data set, I may very well have missed a crucial aspect of the complex interaction taking place between various types of RNA in the cancer experiments. Cancer, it turns out, is a very complex disease on the genetic level as well as on the medical one.


Suggesting deliverables: guess and check

Your customer probably doesn’t understand data science and what it can do. Asking them, “What would you like to appear in the final report?” or “What should this analytic application do?” can easily result in, “I don’t know,” or, even worse, a suggestion that doesn’t really make sense. Data science is not their area of expertise, and they are probably not fully aware of the possibilities and limitations of software and data. So, it is usually best to approach the question of final product with a series of suggestions, and then to note the customer’s reaction.

One of my favorite questions to ask a customer is, “Can you give me an example of a sentence that you might like to see in a final report?” I might get responses such as, “I’d like to see something like: ‘MicroRNA-X seems to inhibit Gene Y significantly’;” or “Gene Y and Gene Z seem to be expressed at the same levels in all samples tested.” Answers like these give a great starting point for conceiving the format of the final product. If the customer can give you seed ideas like these, you can expand upon them to make suggestions of final products. You might then ask, “What if I gave you a table of the strongest interactions between specific microRNAs and genetic mRNAs?” Maybe the customer would say that this would be valuable, or maybe not.

It is most likely, however, that a customer will make less clear statements, such as, “I’d like to know which microRNAs are important in cancer development.” To this, of course, we need clarification if we hope to complete the project successfully. What does “important” mean in a biological sense? How might this importance manifest itself in the available data? It is important to get answers to these questions before proceeding; if you don’t know how microRNA importance might manifest itself in the data, how will you know when you’ve found it?

One mistake that I and many others have made is to conflate correlation with significance. Some people talk about the confusion of correlation and causation, an example of which is: a higher percentage of helmet-wearing cyclists are involved in accidents than non-helmet-wearing cyclists; it might be tempting to conclude that helmets cause accidents, but this is probably fallacious. The correlation between helmets and accidents does not imply that helmets cause accidents; nor does it imply that accidents cause helmets [directly]. In reality, cyclists who ride on busier and more dangerous roads are more likely to wear helmets and also more likely to get into accidents. Essentially, the act of riding on more dangerous roads causes both. In the question of helmets and accidents, there is no direct causation despite the existence of correlation. Causation, in turn, is merely one example of a way that correlation might be significant. If you are performing a study on the use of helmets and the rates of accidents, then this correlation might be significant even if it does not imply causation. It should be stressed that significance, as I use the term, is determined by the project’s goals. This knowledge of a helmet-accident correlation could lead to considering (and modeling) the level of traffic and danger on each road as part of the project. Significance, also, is not guaranteed by correlation. I am fairly certain that more cycling accidents happen on sunny days, but this is because more cyclists are on the road on sunny days, and not because of any other significant relationship (barring rain, of course). It is not immediately clear to me how I might use this information to further my goals, and so I wouldn’t spend much time exploring it. The correlation simply does not seem to have any significance in this particular case.

In gene/RNA expression experiments, there are often thousands of RNA sequences that are measured within only 10-20 biological samples. Such an analysis with far more variables (expression levels for each RNA sequence) than data points (samples) is called “high-dimensional” or often “under-determined” because there are so many variables that some of them are correlated simply by random chance, and it would be fallacious to say that they are truly related, in a real biological sense. If you present a list of strong correlations to the biology professor, he will spot immediately that some of your reported correlations are unimportant or, worse, contrary to established research, and you’ll have to go back and do more analyses.


Iterate your ideas based on knowledge, not wishes

Just as it is important, within your acquired domain knowledge, to separate fact from opinion, it is just as important to avoid letting excessive optimism make you blind to obstacles and difficulties. I’ve long said that an invaluable skill of good data scientists is the ability to foresee potential difficulties and to leave open a path around them.

It is popular, in the software industry today, to make claims about analytic capabilities while they are still under development. This, I’ve learned, is a tactic of salesmanship that often seems necessary, in particular for young start-ups, to get ahead in a very competitive industry. It always makes me nervous when someone is actively selling a piece of analytic software that I said I think I can build, but which I’m not 100% sure will work as planned, given some limits of the data we have available. So, when I make such bold statements, I try to keep them, as much as possible, in the realm of things that I’m almost certain I can do, and in the case that I can’t, I try to have a back-up plan that doesn’t involve the trickiest parts of the original plan.

Imagine you want to develop an application that summarizes news articles. would need to create an algorithm that can parse the sentences and paragraphs in the article and extract the main ideas. It is possible to write an algorithm that does this, but it is not clear how well it will perform. Summaries may be successful in some sense for a majority of articles, but there is a big difference between 51% successful and 99% successful, and you won’t know where your particular algorithm falls within that range until you’ve built a first version, at least. Blindly selling and feverishly developing this algorithm might seem like the best idea; hard work will pay off, right? Maybe. This task is hard. It is entirely possible that, try as you might, you never get better than 75% success, and maybe that’s not good enough, from a business perspective. What do you do then? Do you give up and close up shop? Do you, only after this failure, begin looking for alternatives?

A good data scientist knows when a task is hard even before he begins. Sentences and paragraphs are complicated, random variables that often seem designed specifically to thwart any algorithm you might throw at it. In case of failure, I always go back to “first principles”, in a sense: What problem am I trying to solve? What is the end goal, beyond summarization?

If the end goal is to build a product that makes reading news more efficient, maybe there is another way to address the problem of inefficient news readers. Perhaps it is easier to aggregate similar articles and present them to the reader together. Maybe it is possible to design a better news reader through friendlier design or by incorporating social media.

No one ever wants to declare failure, but data science is a risky business, and to pretend that failure never happens is a failure in itself. There are always multiple ways to address a problem and formulating a plan that acknowledges a likelihood of obstacles and failure can allow you to gain value from minor successes along the way, even if the main goal is not achieved.

A far greater mistake would be to ignore the possibility of failure and also the need to test and evaluate the performance of the application. If you assume that the product works near-perfectly, but it doesn’t, delivering the product to your customer can be a huge mistake. Can you imagine if you began selling an untested application that supposedly summarizes news articles, but soon thereafter your users begin to complain that the summaries are completely wrong? Not only would the application be a failure, but you and your company might gain a reputation for software that doesn’t work.


For more, check out the book on liveBook here and see this Slideshare presentation.