Description: https://images.manning.com/360/480/resize/book/0/e848891-d7dd-47e6-b200-d943af5a91d1/Lerner-MEAP-HI.png

From Pandas Workout by Reuven Lerner

This article discusses cleaning data to use with Pandas.


Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com.


In the late 1980s, I worked for a company that wanted to know how much rain had fallen in a large number of US cities. Their solution? They gave me a list of cities and phone numbers, and asked me to call each of them in sequence, recording the previous day’s rainfall in an Excel spreadsheet. Nowadays, getting that sort of information - and many other types of information — is pretty easy. Not only do many governments provide data sets for free, but numerous companies make data available for a price. No matter what topic you’re researching, data is almost certainly available. The only questions are where you can get it, how much it’ll cost, and what format the data comes in.

Actually, you should ask another question, too: How accurate is the data you’re using?

All too often, we assume that if we’re downloading a CSV file from an official-looking Web site, the data it contains is good. But all too often, the data that we download has problems. That shouldn’t surprise us, given that the data comes from people (who can make a variety of types of mistakes) and machines (which make different types of mistakes). Maybe someone accidentally misnamed a file, or entered data into the wrong field. And maybe the automatic sensors whose inputs were used in collecting the data were broken, or offline. Maybe the servers were down for a day, or someone misconfigured the XML feed-reading system, or the routers were being rebooted, or a backhoe cut the Internet line.

All of this assumes that there was actually data to begin with. Often we’ll have missing data because there wasn’t any data to record.

This is why I’ve often heard data scientists say that 80 percent of their job involves cleaning data. What does it mean to “clean data”? Here is a partial list:

  • rename columns
  • rename the index
  • remove irrelevant columns
  • split one column into two
  • combine two or more columns into one
  • remove non-data rows
  • remove repeated rows
  • remove rows with missing data (aka NaN)
  • replace NaN data with a single value
  • replace NaN data via interpolation
  • standardize strings
  • fix typos in strings
  • remove whitespace from strings
  • correct the types used for columns
  • identify and remove outliers

The importance of cleaning your data, and thus ensuring that your analysis is as accurate is possible, cannot be overstated.

In this article, we’ll thus be looking at a few pandas techniques for cleaning our data. First, we’ll look at a few ways in which we can handle NaN values.

Useful references

Table 1. What you need to know

Concept

What is it?

Example

To learn more

s.isnull

Returns a boolean series indicating where there are null (typically NaN) values in the series s

s.isnull()

link

df.isnull

Returns a boolean data frame indicating where there are null (typically NaN) values in the data frame df

df.isnull()

link

df.replace

Replace values in one or more columns with other values

df.replace('a':{'b':'c'), 'd')

link

s.map

Apply a function to each element of a series, returning the result of that application on each element

`s.map(lambda x: x**2)

link

df.fillna

Replace NaN with other values

df.fillna(10)

link

df.dropna

Remove rows with NaN values

df = df.dropna()

link

s.str

Working with textual data

df['colname'].str

link

df.sort_index

Reorder the rows of a data frame based on the values in its index, in ascending order

df = df.sort_index()

link

pd.read_excel

Create a data frame based on an Excel spreadsheet

df = pd.read_excel('myfile.xlsx')

link

s.value_counts

returns a sorted (descending frequency) series counting how many times each value appears in s

s.value_counts()

link

s.unique

returns a series with the unique (i.e., distinct) values in s, including NaN (if it occurs in s)

s.unique()

link

How much is missing?

We’ve already seen, on a number of occasions, that data frames (and series) can contain NaN values. One question we often want to answer is: How many NaN values are there in a given column? Or, for that matter, in a data frame?

One solution is to calculate things yourself. There is a count method you can run on a series, which returns the number of non-null values in the series. That, combined with the shape of the series, can tell you how many NaN values there are:

 
 s.shape[0] - s.count() 
  

Returns an integer, the number of null elements

This is tedious and annoying. And besides, shouldn’t pandas provide us with a way to do this? Indeed it does, in the form of the isnull method. If you call isnull on a column, it returns a boolean series, one that has True where there is a NaN value, and False in other places. You can then apply the sum method to the series, which will return the number of True values, thanks to the fact that Python’s boolean values inherit from integers, and can be in place of 1 (True) and 0 (False) if you need:

 
 s.isnull().sum() 
  

Calculate the number of NaN values in s

If you run isnull on a data frame, then you will get a new data frame back, with True and False values indicating whether there is a null value in that particular row-column combination. And of course, then you can run sum on the resulting data frame, finding how many NaN values there are in each column:

 
 df.isnull().sum() 
  

Calculate the number of NaN values in each column

Instead of summing the results of a call to isnull, you can also use the any and all methods, both of which return boolean values. any will return True for each row in which at least one of the values is True, and all will return True for each row in which all of the values are True. You can thus do the following:

 
 df[df.isnull().all()] 
  

Show only the rows without NaN

Finally, the df.info method returns a wealth of information about the data frame on which it is run, including the name and type of each column, a summary of how many columns there are of each type, and the estimated memory usage. If the data frame is small enough, then it’ll also show you how many null values there are in each column. However, this calculation can take some time. Thus, the df.info will only count null values below a certain threshold. If you’re above that threshold (the pd.options.display.max_info_columns option), then you’ll need to tell pandas explicitly to count, by passing show_counts=True:

 
 df.info(show_counts=True) 
  

Get full information about the data frame df, including the number of null values in each column

pandas defines both isna and isnull for both series and data frames. What’s the difference between them? Actually, there is no difference. If you look at the pandas documentation, you’ll find that they’re identical except for the name of the method being called. I mostly use isnull, but if you prefer to go with isna, then be my guest.

Note that both of these are different from np.isnan, a method defined in NumPy, on top of which pandas is defined. I try to stick with the methods that pandas defines, which integrate better into the rest of the system, in my experience.

Rather than using ~, which pandas uses to invert boolean series and data frames, you can often use the notnull methods, for both series and data frame.

That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.