|
From Practical Data Science with R, 2nd Ed. By Nina Zumel and John Mount In this article, we demonstrate some ways to get to know your data, and discuss some of the potential issues that you’re looking for as you explore. |
Take 37% off Practical Data Science with R, Second Edition. Just enter fcczumel3 into the discount code box at checkout at manning.com.
Figure 1 Mental Model of Practical Data Science with R
As shown in the mental model (figure 1), this article emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data which is as clean and useful as possible.
It’s tempting to dive right into the modeling step without looking closely at the dataset first, like when you have a lot of data. Resist the temptation. No dataset is perfect: you’ll be missing information about some of your customers, and you’ll have incorrect data about others. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions—and you won’t be sure why.
TIP |
Get to know your data before modeling By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches! |
You’d also like to get a sense of who your customers are: Are they young, middle-aged, or seniors? How affluent are they? Where do they live? Knowing the answers to these questions can help you build a better model, because you’ll have a more specific idea of what information most accurately predicts the probability of insurance coverage.
Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization, or graphs of the data. You can spot some problems by using summary statistics; other problems are easier to find visually.
\
Using summary statistics to spot problems
In R, you’ll typically use the summary
command to take your first look at the data. The goal is to understand whether you have the kind of customer information that can potentially help you predict health insurance coverage, and whether the data is of good enough quality to be informative.
Listing 1. The summary()
command
setwd("PDSwR2/Custdata") ❶ customer_data = readRDS("custdata.RDS") summary(customer_data) ## custid sex is_employed income ❷ ## Length:73262 Female:37837 FALSE: 2351 Min. : -6900 ## Class :character Male :35425 TRUE :45137 1st Qu.: 10700 ## Mode :character NA's :25774 Median : 26200 ## Mean : 41764 ## 3rd Qu.: 51700 ## Max. :1257000 ## ## marital_status health_ins ❸ ## Divorced/Separated:10693 Mode :logical ## Married :38400 FALSE:7307 ## Never married :19407 TRUE :65955 ## Widowed : 4762 ## ## ## ## housing_type recent_move num_vehicles ❹ ## Homeowner free and clear :16763 Mode :logical Min. :0.000 ## Homeowner with mortgage/loan:31387 FALSE:62418 1st Qu.:1.000 ## Occupied with no rent : 1138 TRUE :9123 Median :2.000 ## Rented :22254 NA's :1721 Mean :2.066 ## NA's : 1720 3rd Qu.:3.000 ## Max. :6.000 ## NA's :1720 ## age state_of_res gas_usage ❺ ## Min. : 0.00 California : 8962 Min. : 1.00 ## 1st Qu.: 34.00 Texas : 6026 1st Qu.: 3.00 ## Median : 48.00 Florida : 4979 Median : 10.00 ## Mean : 49.16 New York : 4431 Mean : 41.17 ## 3rd Qu.: 62.00 Pennsylvania: 2997 3rd Qu.: 60.00 ## Max. :120.00 Illinois : 2925 Max. :570.00 ## (Other) :42942 NA's :1720
❶ Change this to your actual path to the directory where you unpacked PDSwR2
❷ The variable is_employed is missing for about a third of the data. The variable income has negative values, which are potentially invalid.
❸ About 90% of the customers have health insurance.
❹ The variables housing_type, recent_move, num_vehicles, and gas_usage are each missing 1720 or 1721 values.
❺ The average value of the variable age seems plausible, but the minimum and maximum values seem unlikely. The variable state_of_res is a categorical variable; summary() reports how many customers are in each state (for the first few states).
The summary
command on a data frame reports a variety of summary statistics on the numerical columns of the data frame, and count statistics on any categorical columns (if the categorical columns have already been read in as factors[2] ).
As you see from listing 1, the summary of the data helps you quickly spot potential problems, like missing data or unlikely values. You also get a rough idea of how categorical data is distributed. Let’s go into more detail about the typical problems that you can spot using the summary.
Typical problems revealed by data summaries
At this stage, you’re looking for several common issues:
- Missing values
- Invalid values and outliers
- Data ranges that are too wide or too narrow
- The units of data
Let’s address each of these issues in detail.
MISSING VALUES
A few missing values may not be a problem, but if a particular data field is largely unpopulated, it shouldn’t be used as an input without some repair. In R, for example, many modeling algorithms, by default, quietly drop rows with missing values. As you see in listing 2, all the missing values in the is_employed
variable could cause R to quietly ignore over a third of the data.
Listing 2. Will the variable is.employed
be useful for modeling?
## is_employed ❶ ## FALSE: 2321 ## TRUE :44887 ## NA's :24333 ## housing_type recent_move ❷ ## Homeowner free and clear :16763 Mode :logical ## Homeowner with mortgage/loan:31387 FALSE:62418 ## Occupied with no rent : 1138 TRUE :9123 ## Rented :22254 NA's :1721 ## NA's : 1720 ## ## ## num_vehicles gas_usage ## Min. :0.000 Min. : 1.00 ## 1st Qu.:1.000 1st Qu.: 3.00 ## Median :2.000 Median : 10.00 ## Mean :2.066 Mean : 41.17 ## 3rd Qu.:3.000 3rd Qu.: 60.00 ## Max. :6.000 Max. :570.00 ## NA's :1720 NA's :1720
❶ The variable is_employed is missing for over a third of the data. Why? Is employment status unknown? Did the company start collecting employment data only recently? Does NA mean “not in the active workforce” (for example, students or stay-at-home parents)?
❷ The variables housing_type, recent_move, num_vehicles and gas_usage are missing relatively few values—about 2% of the data. It’s probably safe to just drop the rows that are missing values, especially if the missing values are all in the same 1720 rows.
If a particular data field is largely unpopulated, it’s worth trying to determine why; sometimes the fact that a value is missing is informative in and of itself. For example, why is the is_employed
variable missing many values? As we noted in listing 2, there are many possible reasons.
Whatever the reason for missing data, you must decide on the most appropriate action. Do you include a variable with missing values in your model? If you decide to include it, do you drop all the rows where this field is missing, or do you convert the missing values to 0
or to an additional category? In this example, you might decide to drop the data rows where that exclude data about housing or vehicles, because there aren’t many of them. You probably don’t want to throw out the data where you’re missing employment information, because employment status is probably highly predictive of having health insurance; you might instead treat the NAs as a third employment category. You’ll likely encounter missing values when model scoring, and you should deal with them during model training.
That’s all for now. If you want to know more about the book, check it out on liveBook here and see this slide deck.
[1] We have a copy of this synthetic dataset available for download from https://github.com/WinVector/PDSwR2/tree/master/Custdata, and once saved, you can load it into R with the command customer_data <- readRDS("custdata.RDS")
. This data set is derived from census data. We introduced a little noise to the age
variable to reflect what’s typically seen in real-world noisy data sets. We also included columns which are not necessarily relevant to our example scenario, but which exhibit some important data anomalies.
[2] Categorical variables are of class factor
in R. They can be represented as strings (class character
), and some analytical functions automatically convert string variables to factor variables. To get a useful summary of a categorical variable, it needs to be a factor.