From Causal Inference in Data Science by Aleix Ruiz de Villa
Causal inference models predict why something will happen, i.e. causal effects, rather than the outcomes themselves. This is useful in many instances and is a budding field in machine learning and data science.
Read on to see how it works and what you will learn from this book.
In most of the machine learning applications you find in commercial enterprises (and outside research), your objective is to make predictions. So, you create a predictive model that, with some accuracy, will make a guess about the future. For instance, a hospital may be interested in predicting which patients are going to be severely ill, so that they can prioritize their treatment. In most predictive models, the mere prediction will do; you don’t need to know why it is the way it is.
Causal inference works the other way around. You want to understand why, and moreover you wonder what could be done to get a different outcome. A hospital, for instance, may be interested in the factors that affect some illness. Knowing these factors will help them to create public healthcare policies or drugs to prevent people from getting ill. The hospital wants to change how things currently are, in order to reduce the number of people ending up in the hospital.
It is important to learn about causal inference because it gives us tools to estimate causal effects. That is, it gives us ways to discern correlation from causation, so that we can tell which are the relevant factors that are causing an event of interest and which are not.
How Causal Inference works
Let’s continue with our example of finding out the causes of a particular illness. Imagine that you have a dataset with information about your patients (such as demographics, number of visits to the hospital, and the like) and what treatments they have received. What steps would you need to follow to analyze the causes of the illness? Let’s use this example to see how causal inference works, in five steps as summarized in Figure 1.
Figure 1. Five steps describing the typical process in casual inference
1. Determine the type of data
The first thing you need to know is how your data has been created. In some cases, before gathering the data, we can design and run an experiment, in order to control the environment so that we can safely attribute the impact of a cause. In these situations, we are dealing with what is called experimental data. Unfortunately, it is not always possible to run experiments. Imagine that you want to understand the effects of smoking in teenagers. If you suspect that smoking produces cancer, you cannot run an experiment with teenagers, where you decide who smokes and who does not, because it is not ethical. Another situation is when you analyze historical data in which you weren’t able to run a controlled experiment. When we don’t have experimental data, which turns out to be most of the time, we say that we have observational data. There is a huge difference between these two types of data: generally speaking, we will be much more confident about the results obtained from experimental data than those obtained from observational data.
As we will see later in this chapter, experimental data is always the preferred option. Unfortunately, running experiments is not always possible. That is when causal inference techniques come in. Simply put, if you have causal questions and you don’t have access to experimental data, then you need causal inference. In our example, there has not been any type of experiment design; we have just stored the data as we were receiving it.
2. Understand your problem
Moving to the second step, if you want to understand what makes people ill, you need to gather all the potential causes of the illness. Besides basic demographics such as age gender, and location, you also need their medical history. “The more information you have, the better” may sound to you like a Big Data mantra, but this is different. While in machine learning you can create accurate predictive models without having all the variables, in causal inference, missing a relevant variable can be crucial. For example, for some illnesses, having comorbidities (other illnesses different from the one we are interested in) may be very relevant. Imagine that for whatever reason, you don’t have access to patients’ comorbidities. Then, you will not be able to determine which comorbidities affect your illness. In contrast, you may still be able to create a successful predictive machine learning algorithm that tells you whether someone is going to be ill or not: comorbidities make patients visit the hospital more times. So, even though you don’t have patients’ comorbidities, you may have a highly correlated information which is the frequency of patient’s visits, which may be enough to predict patients’ likelihood to get ill.
3. Create a model
Third, now that you have all the relevant variables, you create a causal model. This model should describe which variables cause which others, or equivalently, describe how data was generated. This model, as in physics, engineering or any other scientific discipline, is sometimes good and sometimes not so good. If the model is too simple, it will explain reality poorly, while complex models are more prone to introduce human errors. So, there is a trade-off between these two problems when determining the level of complexity of your model. The way to decide if a model is an accurate enough approximation of the reality is simply whether it ends up being useful to our purpose or not, and this, in turn, depends heavily on what you want to achieve. For instance, imagine that you develop a machine learning model that is used in a mobile app to detect objects from any picture that you take with your camera. Since you are using it for your personal use, if the model fails from time to time, it is probably not that bad and you can live with that. However, if you are using a machine learning model in a self-driving car to detect objects on the road, a mistake can turn into an accident. In this case, a model with an accuracy less than 99.99999% is not reliable, hence not useful.
In causal inference, one way to create this model is using graphs that describe variables as nodes and causal effects between variables as directed edges (called arrows). Another way to create models is only using equations.
4. Share your model
You arrive at step four with the model you have created, based on some assumptions and with a goal in mind (to estimate some causal effect). Now make these assumptions and goals explicit and seek consensus with experts and other people involved in the analysis. You may have missed relevant variables or misinterpreted how variables were related. It is a good practice to communicate with others to make sure you are articulating the appropriate questions, agreeing on the overall goals and identifying the possible variables and how these variables relate to each other.
5. Apply causal inference techniques
Finally, at the fifth step, it is time to apply causal inference techniques to your dataset (notice that it has not been used so far) and your model to answer your causal questions. As we’ve said, correlation is not causation, so the fact that two variables are correlated, doesn’t meant that one causes the other. Usually, when two variables are correlated but one is not the only cause of the other, it is due to the existence of a third factor that causes both. Variables playing this role of common cause are called confounders. Informally speaking, in causal inference, the presence of confounders is the root of all evil. Fortunately, causal inference has a set of formulas, algorithms and methodologies that lets you deal with them, through the following steps:
- Ask yourself what can we answer with the information we have? State which of your causal questions can be answered using your model and your data: sometimes, the lack of information about some confounders becomes a problem. Identify which cases it can be overcome, and for the rest analyze alternatives, such as gathering new data, or finding surrogate variables.
- See if the assumptions you took in creating your model are substantiated by your data? Fortunately, as we will see, some of these assumptions can be checked with your data.
- Discern correlation from causation using your own data in order to estimate causal effects. This is done using a specific set of formulas. Most of this book is devoted to explaining how, when, and why to employ these formulas, how to select the appropriate formulae for different kinds of problems, and how to apply them efficiently using statistical and machine learning techniques.
Causal inference is a combination of methodology and tools that helps us in our causal analysis. Historically, it has three sources of development: statistics in healthcare and epidemiology, econometrics, and computer science. Currently there are two popular formal frameworks to work with causal inference. Each framework uses different notations and basic concepts, but they are inherently similar. Both will give you the same results in many problems, but in some cases one or the other will be more appropriate to use. One framework uses a type of graphs, called Directed Acyclic Graphs (DAG), developed and popularized mostly by Judea Pearl (a computer scientist who won a Turing award in 2011 for his contribution to causal inference), and others. The other is based on Potential Outcomes (PO), which is closer to the way of thinking used in statistics, and was developed and popularized by, among others, Donald Rubin, James Robins (who used it in biostatistics & epidemiology), Guido Imbens and Joshua Angrist (who applied it in econometrics and won a Nobel award in 2021 for their contribution to causal inference). In parts I and II of this book, we will use the language of DAGs (following the work of Judea Pearl) and in part III we will work with POs.
The learning journey
This book is for data scientists, data analysts, economists and statisticians who want to improve their decision making using observational data. You will start by learning how to identify when you have a causal problem and when you don’t. Not all problems are causal: we can find situations where we just want descriptive statistics or we need forecasting models. Knowing which kind of problem you are dealing with, will help you to choose the right tools to work with it
Then the book will take you through the process of carrying out the causal inference process I described earlier. Along the way, you will learn the following concepts:
- Distinguish when you need experiments, causal inference, or machine learning.
- To model reality using causal graphs.
- To communicate more efficiently through graphs to explain your objectives, your assumptions, what risks are you taking, what can be answered from your data and what can’t.
- To determine if you have enough variables for your analysis and in case you don’t, be able to propose which ones are necessary.
- To estimate causal effects using statistical and machine learning techniques.
To walk through the book, you will need a basic background (what they are, when are they used and some experience is recommended), on the following topics:
- Basic probability formulas such as the law of total probability and conditional probabilities.
- Basic probability distributions such as gaussian or binomial.
- How to generate random numbers with a computer.
- Linear and logistic regression
- Confidence intervals
- Basic knowledge of A/B testing or Randomized Controlled Trials (how group assignment is done and hypothesis testing) is recommended
- Basic coding skills (read/write basic programs) with at least one programming language. Some examples are Python, R or Julia.
- Machine Learning
- What cross validation is and how to compute it
- Experience with machine learning models such as kNN, random forests, boosting or deep learning is recommended.
This book has three parts. The first one will solve the problem of distinguishing causality from correlation in the presence of confounders, which are common causes between the decision and outcome variables. This situation is explained later in this chapter, in the section called “A general diagram in casual inference,” which is a very important example that you need to have in mind throughout the book. In the second part of the book, we will learn how to apply the previous solution to concrete examples, using statistical and machine learning tools. In the third part you will find a very useful set of tools, developed mainly in the econometrics literature, that help you estimate causal effects in very specific situations. For instance, imagine that you work in a multinational company and you change your product in one country. You could use data from other countries, before and after the change, to infer the impact of your new product on your sales.
Before we start learning about causal inference, however, it’s important that you understand the difference between obtaining data through experiments where you set a specific context and conditions to make conclusions or by observation, where data is obtained as it comes, in an uncontrolled environment. In the former case, taking causal conclusions is relatively straightforward, while in the latter case, it becomes much more complex.
Developing intuition and formal methodology
There are two aspects that are very important for you to end up using causal inference. The first one is that you feel comfortable with its ideas and techniques and the second is that you find problems to apply it. This book puts a special focus on both.
Developing your intuition
Causal inference is a fascinating field because it contains, at the same time, very intuitive and very un- intuitive ideas. We all experience causality in our lives and think regularly through a cause-effect lens. We agree, for instance, that when it rains, and the floor gets wet, the cause of the wet floor is the rain. It is as simple as that. However, If you try to find out how we actually know there is a causal relationship between them, you soon realize that it is not trivial at all. We only see that one thing precedes. As you go through this book, you will encounter concepts that might be unfamiliar or even unexpected. You may need to reconsider some concepts with which you are very familiar, such as conditional probabilities, linear models, and even machine learning models, and view them from a different perspective. I introduce these new points of view through intuitive examples and ideas. However, working at the intuitive level can come at the cost of some formalism. Don’t get me wrong, definitions, theorems and formulas have to be 100% precise and being informal cannot be an excuse for being wrong. But, in the spirit of “all models are wrong, but some are useful” (as George E.P. Box once said), in this book, I prioritize explaining useful ideas over formal ones, usually through metaphors and simplifications.
As you probably know, and has been proven mathematically, it is impossible to make a 2D map of the world preserving accurate distances. (The distance of any two points on earth is proportional to the distance of those two projected points on the map). When you peel an orange, you cannot make it flat without breaking some part of it. In a flat map of the world, there will always be some cities whose distance on the map does not accurately represent their distance in the world. Nonetheless, the map is still useful. In the same way, you may find some degree of informality when this book discusses the differences between causal inference, machine learning and statistics. For instance, I will say that causal inference is for finding causes, while machine learning is for predicting. It shouldn’t be understood as an absolute statement, but more of a generalization that can be helpful to you as you identify which kind of problem you are dealing with and which tools you will need to apply. As you get to know causal inference better, you will discover that, as in any other area of human knowledge, there is overlap between subjects and approaches, and the boundaries between them are quite blurry. In case you want to dive into the formal foundations of causal inference, I strongly suggest you to read Pearl’s “Causality” book.
This book relies heavily on examples to not just explain causal inference, but also to show how to apply it. There is a trade-off inherent in the level of detail required to describe them. The higher detail, the more realistic the example. However, too much detail will prevent you from “seeing the forest for the trees,” thus become counter-productive. I generally try to keep them simple for teaching purposes. The resultant benefit is that they are more flexible to be adapted later on to your own problems. Their role should be an inspirational seed that gives you something to start with. Once you have a working knowledge of the basic elements (treatment, what acts as an outcome, which are the potential confounders, etc.), you will be situated to add the details specific to whatever problem you have at hand.
Practicing the methodology
In addition to exercises, I also rely on repetition to help you to integrate the causal inference techniques into your toolbox. You will see how to calculate causal effects with binary variables, linear models, many different algorithms combining machine learning models, and more. At first, each chapter may seem different, but at some point, you may start to feel that we are always doing the same thing, but from different points of view. That’s good, because it means that you are getting the point!
No tool is useful if we cannot put it into practice. But here practicality has two sides: what you should do and what you shouldn’t do. And in causal inference, the latter is sometimes more important than the former. Consider the example from the beginning of the book, where we are interested in knowing what causes a particular illness. We may know that exercise prevents it (a known variable), we may suspect that socio-economic status influences it (a known unknown if we don’t have this information from patients), but still there is a potentially large list of causes that may affect the illness but we are not aware of them (the unknown unknowns). In causal inference the unknown unknowns are crucial. In causal inference the unknown unknowns are crucial. No, repeating the last sentence is not a typo, I’m just emphasizing its importance! So, besides the practical aspect of knowing which formulas you need to apply in each situation, there is the practical aspect of choosing which battles to fight and which not. Being aware of what you know and what you don’t know, should help you to avoid a lot of problems, not least by allowing you to choose those projects in which you have more chances of success.
If you want to learn more about the book, check it out here.