An excerpt from Causal Inference for Data Science by Aleix Ruiz de Villa

This article explains:

·   Why and when we need causal inference

·   How causal inference works

And how the book approaches the topic.


Take 25% off Causal Inference with Data Science by entering fccruizdevilla into the discount code box at checkout at manning.com.


In most of the machine learning applications you find in commercial enterprises (and external research), the objective is to make predictions. So, you create a predictive model that, with some accuracy, will make a guess about the future. For instance, a hospital may be interested in predicting which patients are going to be severely ill, so that they can prioritize their treatment. In most predictive models, the mere prediction will do; you don’t need to know why it is the way it is.

Causal inference works the other way around. You want to understand why, and moreover you wonder what could be done to get a different outcome. A hospital, for instance, may be interested in the factors that affect some illness. Knowing these factors will help them to create public healthcare policies or drugs to prevent people from getting ill. The hospital wants to change how things currently are, in order to reduce the number of people ending up in the hospital.

Why should anyone that analyses data be interested in causality? Most of the analysis we, as data scientists or data analysts, are interested in relates in some way or another to questions of causal nature. Intuitively we say that X causes Y, when Y changes if you change X. So, for instance, if you want to understand your customer retention, you may be interested in knowing what you could do so that your customers use your services longer. What could be done differently in order to improve your customers’ experience? This is in essence a causal question: you want to understand what is causing your current customer retention stats, so that you can then find ways to improve them. In the same way, we can think of causal questions in creating marketing campaigns, setting prices, developing novel app features, making organizational changes, implementing new policies, developing new drugs, and on and on. Causality is about knowing what the impact of your decisions is, and what factors affect your outcome of interest.

Ask Yourself

Which types of questions are you interested in when you analyze data? Which of those are related in some way to causality? Hint: remember that many causal questions can be framed as measuring the impact of some decision or finding which factors (especially actionable ones) affect your variables of interest.

The problem is that knowing the cause of something is not as easy as it may seem. Let me explain.

Imagine you want to understand the causes of some illness, and when you analyze the data, you realize that people in the country tend to be sicker than people living in cities. Does this mean that living in the country is a cause of sickness? If that were the case, it would mean that if you move from the country to a city, you would have less of a chance of falling ill. Is that really true? Living in the city, per se, may not be healthier than living in the country, since you are exposed to higher levels of pollution, food is not as fresh or healthy, and life is more stressful. But it’s possible that generally people in cities have higher socio-economic status and they can pay for better healthcare, or they can afford to buy gym memberships and do more exercise to prevent sickness. So, the fact that cities appear to be healthier could be due to socio-economic reasons and not because of the location itself. If this second hypothesis were the case, then moving from the country to a city would not improve your health, on average, but increase your chances of being ill: you still wouldn’t be able to afford good healthcare, and you’d be facing new health threats from the urban environment.

The city-country example shows us a problem we will face often in causal inference. Living in the city and having less chance to fall ill, frequently happens at the same time. However, we have also seen that where you live may not be the only cause of your health. That’s why the phrase “correlation is not causation” is so popular. Because the fact that two things happen at the same time does not mean that one causes the other. There may be other factors, like socio-economic status in our example, that are more relevant for explaining why.

That’s why we need to learn about causal inference: it gives us tools to estimate causal effects. That is, it gives us ways to discern correlation from causation, so that we can tell which relevant factors are causing an event of interest and which are not.

How Causal Inference works

Let’s continue with our example of finding out the causes of a particular illness. Imagine that you have a dataset with information about your patients (such as demographics, number of visits to the hospital, and the like) and what treatments they have received. What steps would you need to follow to analyze the causes of the illness? Let’s use this example to see how causal inference works, in five steps as summarized below.

Five steps describing the typical process in casual inference:

 digraph rmarkdown {
  
 1[shape=Mrecord, label="1. Determine whether you have experimental or observational data"]
 2[shape=Mrecord, label="2. Understand your problem (context, variables, ...) as thoroughly as possible"]
 3[shape=Mrecord, label="3. Create a model that describes how variables are related to each other"]
 4[shape=Mrecord, label="4. Share your model, goal and assumptions with the other people involved in the analysis, to get their input"]
 5[shape=Mrecord, label="5. Apply causal inference techniques to your data and your model to answer your causal questions:\n   a. State what objectives can be answered in your model\n   b. Check your model assumptions against your data\n   c. Estimate causal effects using the appropriate formulas "]
  
  
 1 -> 2
 2 -> 3
 3 -> 4
 4 -> 5
  
 }

1. Determine the type of data

The first thing you need to know is how your data has been created. In some cases, before gathering the data, we can design and run an experiment, in order to control the environment so that we can safely attribute the impact of a cause. In these situations, we are dealing with what is called experimental data. Unfortunately, it is not always possible to run experiments. Imagine that you want to understand the effects of smoking in teenagers. If you suspect that smoking produces cancer, you cannot run an experiment with teenagers, where you decide who smokes and who does not, because it is not ethical. Another situation is when you analyze historical data in which you weren’t able to run a controlled experiment. When we don’t have experimental data, which turns out to be most of the time, we say that we have observational data. There is a huge difference between these two types of data: generally speaking, we will be much more confident about the results obtained from experimental data than those obtained from observational data.

Experimental data is always the preferred option. Unfortunately, running experiments is not always possible. That is when causal inference techniques come in. Simply put, if you have causal questions and you don’t have access to experimental data, then you need causal inference. In our example, there has not been any type of experiment design; we have just stored the data as we were receiving it.

2. Understand your problem

Moving to the second step, if you want to understand what makes people ill, you need to gather all the potential causes of the illness. Besides basic demographics such as age gender, and location, you also need their medical history. “The more information you have, the better” may sound to you like a Big Data mantra, but this is different. While in machine learning you can create accurate predictive models without having all the variables, in causal inference, missing a relevant variable can be crucial. For example, for some illnesses, having comorbidities (illnesses other than the one we are interested in) may be very relevant. Imagine that, for whatever reason, you don’t have access to patients’ comorbidities. This means that you will not be able to determine which comorbidities affect your illness. In contrast, you may still be able to create a successful predictive machine learning algorithm that tells you whether someone is going to be ill or not: comorbidities make patients visit the hospital more times. So, even though you don’t have patients’ comorbidities, you may have a highly correlated information, i.e. the frequency of patient’s visits, which may be enough to predict patients’ likelihood of falling ill.

3. Create a model

Third, now that you have all the relevant variables, you create a causal model. This model should describe which variables cause what, or equivalently, describe how data was generated. This model, as in physics, engineering or any other scientific discipline, is sometimes good and sometimes not so good. If the model is too simple, it will not explain the data well, while complex models are more sensitive to human errors. So, there is a trade-off between these two problems when determining the level of complexity of your model. The way to decide if a model is an accurate enough approximation of the reality is simply whether it ends up being useful for its intended purpose or not, and this, in turn, depends heavily on what you want to achiveve. For instance, imagine that you want to develop a machine learning model that is used in a mobile app to detect objects within any picture that you take with your camera. Since you are using it for your personal use, if the model fails from time to time, it is probably something that you can live with. However, if you are using a machine learning model in a self-driving car to detect objects on the road, a mistake can turn into an accident. In this case, a model with an accuracy less than 99.99999% is not reliable, i.e. not useful.

In causal inference, one way to create this model is using graphs that describe variables as nodes and causal effects between variables as directed edges (called arrows). Another way to create models is only using equations.

4. Share your model

You arrive at step four with the model you have created, based on some assumptions and with a goal in mind (to estimate some causal effect). Now make these assumptions and goals explicit and seek consensus with experts and other people involved in the analysis. You may have missed relevant variables or misinterpreted how variables were related. It is a good practice to communicate with others to make sure you are articulating the appropriate questions, agreeing on the overall goals and identifying the possible variables and how these variables relate to each other.

5. Apply causal inference techniques

Finally, at the fifth step, it is time to apply causal inference techniques to your dataset (notice that it has not been used so far) and your model to answer your causal questions. As we’ve said, correlation is not causation, so the fact that two variables are correlated, doesn’t meant that one causes the other. Usually, when two variables are correlated but one is not the only cause of the other, it is due to the existence of a third factor that causes both. Variables playing this role of common cause are called confounders. Informally speaking, in causal inference, the presence of confounders is the root of all evil. Fortunately, causal inference has a set of formulas, algorithms and methodologies that lets you deal with them, through the following steps:

  1. Ask yourself what can we answer with the information we have? State which of your causal questions can be answered using your model and your data: sometimes, the lack of information about some confounders becomes a problem. Identify which cases it can be overcome, and for the rest analyze alternatives, such as gathering new data, or finding surrogate variables.
  2. See if the assumptions you took in creating your model are substantiated by your data? Fortunately, as we will see, some of these assumptions can be checked with your data.
  3. Discern correlation from causation using your own data in order to estimate causal effects. This is done using a specific set of formulas. Most of this book is devoted to explaining how, when, and why to employ these formulas, how to select the appropriate formulae for different kinds of problems, and how to apply them efficiently using statistical and machine learning techniques.

Causal inference is a combination of methodology and tools that helps us in our causal analysis. Historically, it has three sources of development: statistics in healthcare and epidemiology, econometrics, and computer science. Currently there are two popular formal frameworks to work with causal inference. Each framework uses different notations and basic concepts, but they are inherently similar. Both will give you the same results in many problems, but in some cases one or the other will be more appropriate to use. One framework uses a type of graphs, called Directed Acyclic Graphs (DAG), developed and popularized mostly by Judea Pearl (a computer scientist who won a Turing award in 2011 for his contribution to causal inference), and others. The other is based on Potential Outcomes (PO), which is closer to the way of thinking used in statistics, and was developed and popularized by, among others, Donald Rubin, James Robins (who used it in biostatistics & epidemiology), Guido Imbens and Joshua Angrist (who applied it in econometrics and won a Nobel award in 2021 for their contribution to causal inference). In parts I and II of this book, we will use the language of DAGs (following the work of Judea Pearl) and in part III we will work with POs.

The learning journey

This book is for data scientists, data analysts, economists and statisticians who want to improve their decision making using observational data. You will start by learning how to identify when you have a causal problem and when you don’t. Not all problems are causal: we can find situations where we just want descriptive statistics or we need forecasting models. Knowing which kind of problem you are dealing with, will help you to choose the right tools to work with it

Then the book will take you through the process of carrying out the causal inference process I described earlier. Along the way, you will learn the following concepts:

  • Distinguish when you need experiments, causal inference, or machine learning.
  • To model reality using causal graphs.
  • To communicate more efficiently through graphs to explain your objectives, your assumptions, what risks are you taking, what can be answered from your data and what can’t.
  • To determine if you have enough variables for your analysis and in case you don’t, be able to propose which ones are necessary.
  • To estimate causal effects using statistical and machine learning techniques.

To walk through the book, you will need a basic background (what they are, when are they used and some experience is recommended), on the following topics:

  • Probability
    • Basic probability formulas such as the law of total probability and conditional probabilities.
    • Basic probability distributions such as gaussian or binomial.
    • How to generate random numbers with a computer.
  • Statistics
    • Linear and logistic regression
    • Confidence intervals
    • Basic knowledge of A/B testing or Randomized Controlled Trials (how group assignment is done and hypothesis testing) is recommended
  • Programming
    • Basic coding skills (read/write basic programs) with at least one programming language. Some examples are Python, R or Julia.
  • Machine Learning
    • What cross validation is and how to compute it
    • Experience with machine learning models such as kNN, random forests, boosting or deep learning is recommended.

This book has three parts. The first one will solve the problem of distinguishing causality from correlation in the presence of confounders, which are common causes between the decision and outcome variables. This situation is explained later in this chapter, in the section called “A general diagram in casual inference,” which is a very important example that you need to have in mind throughout the book. In the second part of the book, we will learn how to apply the previous solution to concrete examples, using statistical and machine learning tools. In the third part you will find a very useful set of tools, developed mainly in the econometrics literature, that help you estimate causal effects in very specific situations. For instance, imagine that you work in a multinational company and you change your product in one country. You could use data from other countries, before and after the change, to infer the impact of your new product on your sales.

Before we start learning about causal inference, however, it’s important that you understand the difference between obtaining data through experiments where you set a specific context and conditions to make conclusions or by observation, where data is obtained as it comes, in an uncontrolled environment. In the former case, taking causal conclusions is relatively straightforward, while in the latter case it becomes much more complex.

Developing intuition and formal methodology

There are two aspects that are very important for using causal inference. The first one is that you feel comfortable with its ideas and techniques and the second is that you find problems to apply it. This book puts a special focus on both.

Developing your intuition

Causal inference is a fascinating field because it contains, at the same time, very intuitive and very un-intuitive ideas. We all experience causality in our lives and regularly think through a cause-effect lens. We agree, for instance, that when it rains, and the floor gets wet, the cause of the wet floor is the rain. It is as simple as that. However, if you try to find out how we actually know there is a causal relationship between them, you soon realize that it is not trivial at all. We only see that one thing precedes. As you go through this book, you will encounter concepts that might be unfamilar or even unexpected. You may need to reconsider some concepts with which you are very familiar, such as conditional probabilities, linear models, and even machine learning models, and view them from a different perspective. I introduce these new points of view through intuitive examples and ideas. However, working at the intuitive level can come at the cost of some formalism. Don’t get me wrong, definitions, theorems and formulas have to be 100% precise and being informal cannot be an excuse for being wrong. But, in the spirit of “all models are wrong, but some are useful” (as George E.P. Box once said), in this book, I prioritize explaining useful ideas over formal ones, usually through metaphors and simplifications.

As you probably know, it has been proven to be mathematically impossible to make a 2D map of the world that preserves accurate distances. (The distance of any two points on earth is proportional to the distance of those two projected points on the map). When you peel an orange, you cannot make it flat without breaking some part of it. In a flat map of the world, there will always be some cities whose distance on the map does not accurately represent their distance in the world. Nonetheless, the map is still useful. In the same way, you may find some degree of informality when this book discusses the differences between causal inference, machine learning and statistics. For instance, I will say that causal inference is for finding causes, while machine learning is for predicting. It shouldn’t be understood as an absolute statement, but more of a generalization that can be helpful to you as you identify which kind of problem you are dealing with and which tools you will need to apply. As you get to know causal inference better, you will discover that, as in any other area of human knowledge, there is overlap between subjects and approaches, and the boundaries between them are quite blurry. In case you want to dive into the formal foundations of causal inference, I strongly suggest you to read Pearl’s book Causality.

This book relies heavily on examples to not just explain causal inference, but also to show how to apply it. There is a trade-off inherent in the level of detail required to describe them. The higher detail, the more realistic the example. However, too much detail will prevent you from “seeing the forest for the trees,” thus become counter-productive. I generally try to keep them simple for teaching purposes. The resultant benefit is that they are more flexible to be adapted later on to your own problems. Their role should be an inspirational seed that gives you something to start with. Once you have a working knowledge of the basic elements (treatment, what acts as an outcome, which are the potential confounders, etc.), you will be situated to add the details specific to whatever problem you have at hand.

Practicing the methodology

In addition to exercises, I also rely on repetition to help you to integrate the causal inference techniques into your toolbox. You will see how to calculate causal effects with binary variables, linear models, many different algorithms combining machine learning models, and more. At first, each chapter may seem different, but at some point, you may start to feel that we are always doing the same thing, but from different points of view. That’s good, because it means that you are getting the point!

No tool is useful if we cannot put it into practice. But here practicality has two sides: what you should do and what you shouldn’t do. And in causal inference, the latter is sometimes more important than the former. Consider the example from earlier, where we are interested in knowing what causes a particular illness. We may know that exercise helps prevent it (a known variable), we may suspect that socio-economic status influences it (a known unknown if we don’t have this information from patients), but still there is a potentially large list of causes that may affect the illness that we are unaware of (the unknown unknowns). In causal inference, the unknown unknowns are crucial. In causal inference the unknown unknowns are crucial. No, repeating the last sentence is not a typo, I’m just emphasizing its importance! So, besides the practical aspect of knowing which formulas you need to apply in each situation, there is the practical aspect of choosing which battles to fight. Being aware of what you know and what you don’t know should help you to avoid a lot of problems, not least by allowing you to choose those projects in which you have more chances of success.

That all for now. If you want to see more, check out the book here.