From Practical Data Science with R, 2nd Ed. By Nina Zumel and John Mount
In this article, we demonstrate some ways to get to know your data, and discuss some of the potential issues that you’re looking for as you explore.
Figure 1 Mental Model of Practical Data Science with R
As shown in the mental model (figure 1), this article emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data which is as clean and useful as possible.
Suppose your goal is to build a model to predict which of your customers doesn’t have health insurance. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and details like this.
You’ve put all your data into a single data frame called customer_data which you’ve input into R. Now you’re ready to start building the model to identify the customers you’re interested in.
It’s tempting to dive right into the modeling step without looking closely at the dataset first, like when you have a lot of data. Resist the temptation. No dataset is perfect: you’ll be missing information about some of your customers, and you’ll have incorrect data about others. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions—and you won’t be sure why.
Get to know your data before modeling
By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches!
You’d also like to get a sense of who your customers are: Are they young, middle-aged, or seniors? How affluent are they? Where do they live? Knowing the answers to these questions can help you build a better model, because you’ll have a more specific idea of what information most accurately predicts the probability of insurance coverage.
Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization, or graphs of the data. You can spot some problems by using summary statistics; other problems are easier to find visually.
Organizing data for analysis
For this article, we’ll assume that the data you’re analyzing is in a single data frame. This isn’t how that data is usually stored. In a database, for example, data is usually stored in normalized form to reduce redundancy: information about a single customer is spread across many small tables. In log data, data about a single customer can be spread across many log entries, or sessions. These formats make it easy to add (or in the case of a database, modify) data, but aren’t optimal for analysis. You can often join all the data you need into a single table in the database using SQL, but commands like
join can be used within R to further consolidate data.
Using summary statistics to spot problems
In R, you’ll typically use the
summary command to take your first look at the data. The goal is to understand whether you have the kind of customer information that can potentially help you predict health insurance coverage, and whether the data is of good enough quality to be informative.
Listing 1. The
setwd("PDSwR2/Custdata") ❶ customer_data = readRDS("custdata.RDS") summary(customer_data) ## custid sex is_employed income ❷ ## Length:73262 Female:37837 FALSE: 2351 Min. : -6900 ## Class :character Male :35425 TRUE :45137 1st Qu.: 10700 ## Mode :character NA's :25774 Median : 26200 ## Mean : 41764 ## 3rd Qu.: 51700 ## Max. :1257000 ## ## marital_status health_ins ❸ ## Divorced/Separated:10693 Mode :logical ## Married :38400 FALSE:7307 ## Never married :19407 TRUE :65955 ## Widowed : 4762 ## ## ## ## housing_type recent_move num_vehicles ❹ ## Homeowner free and clear :16763 Mode :logical Min. :0.000 ## Homeowner with mortgage/loan:31387 FALSE:62418 1st Qu.:1.000 ## Occupied with no rent : 1138 TRUE :9123 Median :2.000 ## Rented :22254 NA's :1721 Mean :2.066 ## NA's : 1720 3rd Qu.:3.000 ## Max. :6.000 ## NA's :1720 ## age state_of_res gas_usage ❺ ## Min. : 0.00 California : 8962 Min. : 1.00 ## 1st Qu.: 34.00 Texas : 6026 1st Qu.: 3.00 ## Median : 48.00 Florida : 4979 Median : 10.00 ## Mean : 49.16 New York : 4431 Mean : 41.17 ## 3rd Qu.: 62.00 Pennsylvania: 2997 3rd Qu.: 60.00 ## Max. :120.00 Illinois : 2925 Max. :570.00 ## (Other) :42942 NA's :1720
❶ Change this to your actual path to the directory where you unpacked PDSwR2
❷ The variable is_employed is missing for about a third of the data. The variable income has negative values, which are potentially invalid.
❸ About 90% of the customers have health insurance.
❹ The variables housing_type, recent_move, num_vehicles, and gas_usage are each missing 1720 or 1721 values.
❺ The average value of the variable age seems plausible, but the minimum and maximum values seem unlikely. The variable state_of_res is a categorical variable; summary() reports how many customers are in each state (for the first few states).
summary command on a data frame reports a variety of summary statistics on the numerical columns of the data frame, and count statistics on any categorical columns (if the categorical columns have already been read in as factors ).
As you see from listing 1, the summary of the data helps you quickly spot potential problems, like missing data or unlikely values. You also get a rough idea of how categorical data is distributed. Let’s go into more detail about the typical problems that you can spot using the summary.
Typical problems revealed by data summaries
At this stage, you’re looking for several common issues:
- Missing values
- Invalid values and outliers
- Data ranges that are too wide or too narrow
- The units of data
Let’s address each of these issues in detail.
A few missing values may not be a problem, but if a particular data field is largely unpopulated, it shouldn’t be used as an input without some repair. In R, for example, many modeling algorithms, by default, quietly drop rows with missing values. As you see in listing 2, all the missing values in the
is_employed variable could cause R to quietly ignore over a third of the data.
Listing 2. Will the variable
is.employed be useful for modeling?
## is_employed ❶ ## FALSE: 2321 ## TRUE :44887 ## NA's :24333 ## housing_type recent_move ❷ ## Homeowner free and clear :16763 Mode :logical ## Homeowner with mortgage/loan:31387 FALSE:62418 ## Occupied with no rent : 1138 TRUE :9123 ## Rented :22254 NA's :1721 ## NA's : 1720 ## ## ## num_vehicles gas_usage ## Min. :0.000 Min. : 1.00 ## 1st Qu.:1.000 1st Qu.: 3.00 ## Median :2.000 Median : 10.00 ## Mean :2.066 Mean : 41.17 ## 3rd Qu.:3.000 3rd Qu.: 60.00 ## Max. :6.000 Max. :570.00 ## NA's :1720 NA's :1720
❶ The variable is_employed is missing for over a third of the data. Why? Is employment status unknown? Did the company start collecting employment data only recently? Does NA mean “not in the active workforce” (for example, students or stay-at-home parents)?
❷ The variables housing_type, recent_move, num_vehicles and gas_usage are missing relatively few values—about 2% of the data. It’s probably safe to just drop the rows that are missing values, especially if the missing values are all in the same 1720 rows.
If a particular data field is largely unpopulated, it’s worth trying to determine why; sometimes the fact that a value is missing is informative in and of itself. For example, why is the
is_employed variable missing many values? As we noted in listing 2, there are many possible reasons.
Whatever the reason for missing data, you must decide on the most appropriate action. Do you include a variable with missing values in your model? If you decide to include it, do you drop all the rows where this field is missing, or do you convert the missing values to
0 or to an additional category? In this example, you might decide to drop the data rows where that exclude data about housing or vehicles, because there aren’t many of them. You probably don’t want to throw out the data where you’re missing employment information, because employment status is probably highly predictive of having health insurance; you might instead treat the NAs as a third employment category. You’ll likely encounter missing values when model scoring, and you should deal with them during model training.
 We have a copy of this synthetic dataset available for download from https://github.com/WinVector/PDSwR2/tree/master/Custdata, and once saved, you can load it into R with the command
customer_data <- readRDS("custdata.RDS"). This data set is derived from census data. We introduced a little noise to the
age variable to reflect what’s typically seen in real-world noisy data sets. We also included columns which are not necessarily relevant to our example scenario, but which exhibit some important data anomalies.
 Categorical variables are of class
factor in R. They can be represented as strings (class
character), and some analytical functions automatically convert string variables to factor variables. To get a useful summary of a categorical variable, it needs to be a factor.