cover

From Practical Data Science with R, 2nd Ed. By Nina Zumel and John Mount

In this article, we demonstrate some ways to get to know your data, and discuss some of the potential issues that you’re looking for as you explore.

Take 37% off Practical Data Science with R, Second Edition. Just enter fcczumel3 into the discount code box at checkout at manning.com.


Figure 1 Mental Model of Practical Data Science with R


As shown in the mental model (figure 1), this article emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data which is as clean and useful as possible.

Example Scenario

Suppose your goal is to build a model to predict which of your customers doesn’t have health insurance. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and details like this.

You’ve put all your data into a single data frame called customer_data which you’ve input into R.[1]  Now you’re ready to start building the model to identify the customers you’re interested in.

It’s tempting to dive right into the modeling step without looking closely at the dataset first, like when you have a lot of data. Resist the temptation. No dataset is perfect: you’ll be missing information about some of your customers, and you’ll have incorrect data about others. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions—and you won’t be sure why.

TIP

Get to know your data before modeling

By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches!

You’d also like to get a sense of who your customers are: Are they young, middle-aged, or seniors? How affluent are they? Where do they live? Knowing the answers to these questions can help you build a better model, because you’ll have a more specific idea of what information most accurately predicts the probability of insurance coverage.

Data exploration uses a combination of summary statistics—means and medians, variances, and counts—and visualization, or graphs of the data. You can spot some problems by using summary statistics; other problems are easier to find visually.

\

Organizing data for analysis

For this article, we’ll assume that the data you’re analyzing is in a single data frame. This isn’t how that data is usually stored. In a database, for example, data is usually stored in normalized form to reduce redundancy: information about a single customer is spread across many small tables. In log data, data about a single customer can be spread across many log entries, or sessions. These formats make it easy to add (or in the case of a database, modify) data, but aren’t optimal for analysis. You can often join all the data you need into a single table in the database using SQL, but commands like join can be used within R to further consolidate data.

Using summary statistics to spot problems

In R, you’ll typically use the summary command to take your first look at the data. The goal is to understand whether you have the kind of customer information that can potentially help you predict health insurance coverage, and whether the data is of good enough quality to be informative.

Listing 1. The summary() command

  
 setwd("PDSwR2/Custdata")                                               
 customer_data = readRDS("custdata.RDS")
 summary(customer_data)
 ##     custid              sex        is_employed       income         
 ##  Length:73262       Female:37837   FALSE: 2351   Min.   :  -6900 
 ##  Class :character   Male  :35425   TRUE :45137   1st Qu.:  10700 
 ##  Mode  :character                  NA's :25774   Median :  26200 
 ##                                                  Mean   :  41764 
 ##                                                  3rd Qu.:  51700 
 ##                                                  Max.   :1257000 
 ##                                                                  
 ##             marital_status  health_ins                              
 ##  Divorced/Separated:10693   Mode :logical 
 ##  Married           :38400   FALSE:7307    
 ##  Never married     :19407   TRUE :65955   
 ##  Widowed           : 4762                 
 ##                                            
 ##                                           
 ##                                           
 ##                        housing_type   recent_move      num_vehicles 
 ##  Homeowner free and clear    :16763   Mode :logical   Min.   :0.000 
 ##  Homeowner with mortgage/loan:31387   FALSE:62418     1st Qu.:1.000 
 ##  Occupied with no rent       : 1138   TRUE :9123      Median :2.000 
 ##  Rented                      :22254   NA's :1721      Mean   :2.066 
 ##  NA's                        : 1720                   3rd Qu.:3.000 
 ##                                                       Max.   :6.000 
 ##                                                       NA's   :1720  
 ##       age               state_of_res     gas_usage                  
 ##  Min.   :  0.00   California  : 8962   Min.   :  1.00 
 ##  1st Qu.: 34.00   Texas       : 6026   1st Qu.:  3.00 
 ##  Median : 48.00   Florida     : 4979   Median : 10.00 
 ##  Mean   : 49.16   New York    : 4431   Mean   : 41.17 
 ##  3rd Qu.: 62.00   Pennsylvania: 2997   3rd Qu.: 60.00 
 ##  Max.   :120.00   Illinois    : 2925   Max.   :570.00 
 ##                   (Other)     :42942   NA's   :1720
  

  Change this to your actual path to the directory where you unpacked PDSwR2

  The variable is_employed is missing for about a third of the data. The variable income has negative values, which are potentially invalid.

  About 90% of the customers have health insurance.

  The variables housing_type, recent_move, num_vehicles, and gas_usage are each missing 1720 or 1721 values.

  The average value of the variable age seems plausible, but the minimum and maximum values seem unlikely. The variable state_of_res is a categorical variable; summary() reports how many customers are in each state (for the first few states).

The summary command on a data frame reports a variety of summary statistics on the numerical columns of the data frame, and count statistics on any categorical columns (if the categorical columns have already been read in as factors[2] ).

As you see from listing 1, the summary of the data helps you quickly spot potential problems, like missing data or unlikely values. You also get a rough idea of how categorical data is distributed. Let’s go into more detail about the typical problems that you can spot using the summary.

Typical problems revealed by data summaries

At this stage, you’re looking for several common issues:

  • Missing values
  • Invalid values and outliers
  • Data ranges that are too wide or too narrow
  • The units of data

Let’s address each of these issues in detail.

MISSING VALUES

A few missing values may not be a problem, but if a particular data field is largely unpopulated, it shouldn’t be used as an input without some repair. In R, for example, many modeling algorithms, by default, quietly drop rows with missing values. As you see in listing 2, all the missing values in the is_employed variable could cause R to quietly ignore over a third of the data.

Listing 2. Will the variable is.employed be useful for modeling?

  
 ## is_employed                                       
 ## FALSE: 2321  
 ## TRUE :44887 
 ## NA's :24333 
               
 ##                       housing_type   recent_move  
 ## Homeowner free and clear    :16763   Mode :logical 
 ## Homeowner with mortgage/loan:31387   FALSE:62418   
 ## Occupied with no rent       : 1138   TRUE :9123    
 ## Rented                      :22254   NA's :1721    
 ## NA's                        : 1720                 
 ##                                                                   
 ##                                                                    
 ##   num_vehicles     gas_usage    
 ##  Min.   :0.000   Min.   :  1.00 
 ##  1st Qu.:1.000   1st Qu.:  3.00 
 ##  Median :2.000   Median : 10.00 
 ##  Mean   :2.066   Mean   : 41.17 
 ##  3rd Qu.:3.000   3rd Qu.: 60.00 
 ##  Max.   :6.000   Max.   :570.00 
 ##  NA's   :1720    NA's   :1720
  

  The variable is_employed is missing for over a third of the data. Why? Is employment status unknown? Did the company start collecting employment data only recently? Does NA mean “not in the active workforce” (for example, students or stay-at-home parents)?

  The variables housing_type, recent_move, num_vehicles and gas_usage are missing relatively few values—about 2% of the data. It’s probably safe to just drop the rows that are missing values, especially if the missing values are all in the same 1720 rows.

If a particular data field is largely unpopulated, it’s worth trying to determine why; sometimes the fact that a value is missing is informative in and of itself. For example, why is the is_employed variable missing many values? As we noted in listing 2, there are many possible reasons.

Whatever the reason for missing data, you must decide on the most appropriate action. Do you include a variable with missing values in your model? If you decide to include it, do you drop all the rows where this field is missing, or do you convert the missing values to 0 or to an additional category? In this example, you might decide to drop the data rows where that exclude data about housing or vehicles, because there aren’t many of them. You probably don’t want to throw out the data where you’re missing employment information, because employment status is probably highly predictive of having health insurance; you might instead treat the NAs as a third employment category. You’ll likely encounter missing values when model scoring, and you should deal with them during model training.

That’s all for now. If you want to know more about the book, check it out on liveBook here and see this slide deck.

[1] We have a copy of this synthetic dataset available for download from https://github.com/WinVector/PDSwR2/tree/master/Custdata,  and once saved, you can load it into R with the command  customer_data <- readRDS("custdata.RDS"). This data set is derived from census data. We introduced a little noise to the age variable to reflect what’s typically seen in real-world noisy data sets. We also included columns which are not necessarily relevant to our example scenario, but which exhibit some important data anomalies.

[2] Categorical variables are of class factor in R. They can be represented as strings (class character), and some analytical functions automatically convert string variables to factor variables. To get a useful summary of a categorical variable, it needs to be a factor.