|
From Pandas Workout by Reuven Lerner This article discusses using multi indexes in Pandas. |
Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com
State SAT scores (learning goal: setting and using a multi-index)
Setting an index can make it easier for us to create queries about our data. But sometimes our data is hierarchical in nature. That’s where the pandas
concept of a “multi-index” comes into play. With a multi-index, you can set the index not just to be a single column, but multiple columns. Imagine, for example, a data frame containing sales data: You might want to have sales broken down by year, and then further broken down by region. Once you use the phrase “further broken down by,” a multi-index is almost certainly a good idea.
In this article, we’ll look at a summary of scores from the SAT, a standardized university-admissions test widely used in the United States, as compared with the test taker’s university grade point average (also known as GPA, where 4.0 is the highest possible score) — although we’ll ignore the GPA for now. The CSV file (sat-scores.csv
) has 99 columns and 577 rows, describing all 50 US states and three non-states (Puerto Rico, the Virgin Islands, and Washington, DC), from 2005 through 2015.
In this exercise, I want you to:
- Read in the scores file, only keeping the
Year
,State.Code
,Total.Math
,Total.Test-takers
, andTotal.Verbal
columns. - Create a multi-index based on the year and the two-letter state code.
- How many people took the SAT, total, in 2005?
- What was the average SAT math score in 2010 from New York (
NY
), New Jersey (NJ
), Massachusetts (MA
), and Illinois (IL
)? - What was the average SAT verbal score in 2012-2015 from Arizona (
AZ
), California (CA
), and Texas (TX
)?
Discussion
This article will show the power and flexibility of a multi-index. For starters, you need to load a CSV file and create a multi-index based on the “Year” and “State.Code” columns. We could do this in two stages, first reading the file, including the columns that we wanted, into a data frame, and then choosing two columns to serve as our index:
filename = '../data/sat-scores.csv' df = pd.read_csv(filename, usecols=['Year', 'State.Code', 'Total.Math', 'Total.Test-takers', 'Total.Verbal']) df = df.set_index(['Year', 'State.Code'])
Notice that, as always, the result of set_index
is a new data frame, one which we assign back to df
.
However, you might remember that read_csv
also has a index_col
parameter. If we pass an argument to that parameter, then we can tell read_csv
to do it all in one step — reading in the data frame, and setting the index to be the column that we request. However, it turns out that we can pass a list of columns as the argument to index_col
, thus creating the multi-index as the data frame is collected. For example:
filename = '../data/sat-scores.csv' df = pd.read_csv(filename, usecols=['Year', 'State.Code', 'Total.Math', 'Total.Test-takers', 'Total.Verbal'], index_col=['Year', 'State.Code'])
Now that we have loaded our data frame, we can start to explore our data and answer some questions.
First, I wanted to find out the mean math score for students in four states — New York, New Jersey, Massachusetts, and Illinois, in the year 2010. As usual, we’ll want to use loc
to retrieve the data that’s of interest to us. But we’ll need to combine three things to create the right query:
- From the first part (
Year
) of the multi-index, we only want 2010. - From the second part (
State.Code
) of the multi-index, we only wantNY
,NJ
,MA
, andIL
. - From the columns, we are interested in
Total.Math
.
Remember that when we’re retrieving from a multi-index, we need to put the parts together inside of a tuple. Moreover, we can indicate that we want more than one value by using a list. The result is:
df.loc[(2010, ['NY', 'NJ', 'MA', 'IL']), 'Total.Math'].mean()
The above query retrieves rows with a year of 2010, and coming from any of those four states. We only get the Total.Math
column, on which we then calculate the mean.
The next question asks for a similar calculation, but on several years, as well as several states. Once again, that’s not an issue, if we think carefully about how to construct the query:
- From the first part (
Year
) of the multi-index, we want 2012, 2013, 2014, and 2015. - From the second part (
State.Code
) of the multi-index, we wantAZ
,CA
, andTX
. - From the columns, we are again interested in
Total.Math
.
The query then becomes:
df.loc[([2012,2013,2014,2015], ['AZ', 'CA', 'TX']), 'Total.Math'].mean()
Notice how pandas
figures out how to combine the parts of our multi-index, such that we get only the rows that match both parts.
Solution
filename = '../data/sat-scores.csv' df = pd.read_csv(filename, usecols=['Year', 'State.Code', 'Total.Math', 'Total.Test-takers', 'Total.Verbal']) df = df.set_index(['Year', 'State.Code']) ❶ df.loc[2005, 'Total.Test-takers'].sum() ❷ df.loc[(2010, ['NY', 'NJ', 'MA', 'IL']), 'Total.Math'].mean() ❸ df.loc[([2012,2013,2014,2015], ['AZ', 'CA', 'TX']), 'Total.Math'].mean()❹
❶ Set the index to be a combination of Year
and State.code
❷ Retrieve rows with 2005, and the column Total.Test-takers
, then sum those values
❸ Retrieve rows with 2010 and any of those four states, and the column Total.Math
, then get the average
❹ Retrieve rows from 2012-2015 with those three states, and the column Total.Math
, then get the average
Beyond the exercise
- What were the average math and verbal scores for Florida, Indiana, and Idaho across all years? (You don’t break out the values by state.)
- Which state received the highest verbal score, and in which year?
- Was the average math score in 2005 higher or lower than that in 2015?
Sorting by index
When we talk about sorting data in pandas
, we’re usually referring to sorting the data. For example, I might want to have the rows in my data frame sorted by price or by regional sales code.
But pandas
lets us sort our data frames in an additional way: Based on the index values. We can do that with the method sort_index
, which like so many other methods returns a new data frame with the same content, but whose rows are sorted based on index values. We can thus say:
df = df.sort_index()
If your data frame contains a multi-index, then the sorting will be done primarily along the first level, then along the second level, and so forth.
In addition to having some aesthetic benefits, sorting a data frame by index can make certain tasks easier, or even possible. For example, if you try to select a slice of rows, pandas
insists that the index will be sorted, in order to avoid the ambiguity.
If your data frame is unsorted and has a multi-index, then performing some operations might result in a warning:
PerformanceWarning: indexing past lexsort depth may impact performance
This is pandas
trying to tell you that the combination of large size, multi-index, and an unsorted index are likely to cause you trouble. You can avoid the warning by sorting your data frame via its index.
If you want to check whether a data frame is sorted, you can check this attribute:
df.index.is_monotonic_increasing
Note that this is not a method, but is rather a boolean value. Also note that some older documentation and blogs mentions the method is_lexsorted
, which has been deprecated in recent versions of pandas
; you should instead be using is_monotonic_increasing
.
That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.