|
From R in Action, Third Edition by Robert Kabacoff This article dicusses graphs and graphic using the |
Take 37% off R in Action, Third Edition by entering fcckabacoff3 into the discount code box at checkout at manning.com.
On many occasions, I’ve presented clients with carefully crafted statistical results in the form of numbers and text, only to have their eyes glaze over as the chirping of crickets permeated the room. Yet those same clients had enthusiastic “Ah-ha!” moments when I presented the same information to them in the form of graphs. Often I can see patterns in data or detect anomalies in data values by looking at graphs—patterns or anomalies that I completely missed when conducting more formal statistical analyses.
Human beings are remarkably adept at discerning relationships from visual representations. A well-crafted graph can help you make meaningful comparisons among thousands of pieces of information, extracting patterns not easily found through other methods. This is one reason why advances in the field of statistical graphics have had such a major impact on data analysis. Data analysts need to look at their data, and this is one area where R shines.
The R language has grown organically over the years, through the contributions of many independent software developers. This has led to the creation of four distinct approaches to graph creation in R – base
, lattice
, ggplot2
, and grid
graphics. In this article we’ll focus on ggplot2
, the most powerful and popular approach currently available in R.
The ggplot2
package, written by Hadley Wickham (2009a), provides a system for creating graphs based on the grammar of graphics described by Wilkinson (2005) and expanded by Wickham (2009b). The intention of the ggplot2
package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations.
This chapter walks you through the major concepts and functions used to create ggplot2
graphs by using visualizations to address the following questions:
- What’s the relationship between a worker’s past experience and their salary?
- How can we summarize this relationship simply?
- Is this relationship different for men and women?
- Does it matter what industry the worker is in?
We’ll start with a simple scatterplot displaying the relationship between workers’ experience and wages. Then in each section, we’ll add new features until we’ve produced a single publication quality plot that addresses these questions. At each step, we’ll hopefully gain greater insight into the questions presented.
To answer these questions, we’ll use the CPS85
data frame contained in the mosaicData
package. The data frame contains a random sample of 534 individuals selected from the 1985 Current Population Survey, and includes information their wages, demographics, and work experience. Be sure to install both the mosaicData
and ggplot2
packages before continuing (install.packages(c("mosaicData", "ggplot2))
).
Creating a graph with ggplot2
The ggplot2
package uses a series of functions to build up a graph in layers. We’ll build a complex graph by starting with a simple graph and adding additional elements, one at a time. By default, ggplot2 graphs appear on a grey background with white reference lines. We’ll start by setting the default theme to a white background with light grey reference lines. This looks better when printed in black and white. Let’s load the ggplot2
package and set this default theme.
library(ggplot2) theme_set(theme_bw())
ggplot
The first function in building a graph is the ggplot()
function. It specifies the
- data frame containing the data to be plotted
- the mapping of the variables to visual properties of the graph. The mappings are placed in an
aes()
function (which stands for aesthetics or “something you can see”).
The code below produces the graph in figure 1.
library(ggplot2) library(mosaicData) ggplot(data = CPS85, mapping = aes(x = exper, y = wage))
Figure 1. Mapping worker experience and wages to the x- and y-axes
Why is the graph empty? We specified that the exper
variable should be mapped to the x-axis and that the wage
variable should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph. In this case, we want points to represent each participant.
Geoms
Geoms are the geometric objects (points, lines, bars, and shaded regions) which can be placed on a graph. They’re added using functions that start with the phrase geom_
. Currently, thirty-seven different geoms are available and the list is growing. Table 1 describes the more common geoms, along with frequently used options for each.
Table 1. Geom functions
Function |
Adds |
Options |
|
Bar chart |
|
|
Box plot |
|
|
Density plot |
|
|
Histogram |
|
|
Horizontal lines |
|
|
Jittered points |
|
|
Line graph |
|
|
Scatterplot |
|
|
Rug plot |
|
|
Fitted line |
|
|
Text annotations |
Many; see the help for this function |
|
Violin plot |
|
|
Vertical lines |
|
We’ll add points using the geom_point()
function, creating a scatterplot. In ggplot2
graphs, functions are chained together using the + sign to build a final plot.
library(ggplot2) library(mosaicData) ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) + geom_point()
The results can be seen in figure 2.
Figure 2. Scatterplot of worker experience vs. wages
It appears that as experience goes up, wages go up, but the relationship is weak. The graph also indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case and reproduce the plot.
CPS85 <- CPS85[CPS85$wage < 40, ] ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) + geom_point()
The new graph is displayed in figure 3.
Figure 3 Scatterplot of worker experience vs. wages with outlier removed
A number of options can be specified in a geom_
function (see table 1). Options for geom_point()
include color
, size
, shape
, and alpha
. These control the point color, size, shape, and transparency, respectively. Colors can be specified by name or hexadecimal code. Shape and linetype
can be specified by the name or number representing the pattern or symbol respectively. Point size is specified with positive real numbers starting at zero. Large numbers produce larger point sizes. Transparency ranges from 0 (completely transparent) to 1 (completely opaque). Adding a degree of transparency can help visualize overlapping points.
Let’s make the points in figure 3 larger, semi-transparent, and blue. The code below produces the graph in figure 4.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue", alpha = .7, size = 3)
We’ll also change the gray background to white using theme. I might argue that the chart is more attractive (at least if you have color output), but it doesn’t add to our insights. It would be helpful if the graph had a line summarizing the trend between experience and wages.
Figure 4. Scatterplot of worker experience vs. wages with outlier removed with modified point color, transparency, and point size
We can add this line with the geom_smooth()
function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm
) line (where lm stands for linear model).
ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) + geom_point(color = "cornflowerblue", alpha = .7, size = 3) + geom_smooth(method = "lm")
The results are given in figure 5.
Figure 5 Scatterplot of worker experience vs. wages with a line of best fit
We can see from this line that on average, wages appear to increase to a moderate degree with experience. We are only using two geoms in this example.
Grouping
In the previous section, we set graph characteristics such as color and transparency to a constant value. We can also map variables values to the color, shape, size, transparency, line style, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph (a process called grouping).
Let’s add sex to the plot and represent it by color, shape, and linetype.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex, shape = sex, linetype = sex)) + geom_point(alpha = .7, size = 3) + geom_smooth(method = "lm", se = FALSE, size = 1.5)
By default, the first group (female) is represented by pink filled circles and a solid pink line, and the second group (male) is represented by teal filled triangles and a dashed teal line. The new graph is presented in figure 6.
Figure 6. Scatterplot of worker experience vs. wages with points colored by sex and separate line of best fit for men and women.
Note that the color=sex
, shape=sex
, and linetype=sex
, options are placed in the aes()
function because we’re mapping a variable to an aesthetic. The geom_smooth
option (se = FALSE
) was added to suppresses the confidence intervals, making the graph less busy and easier to read. The size = 1.5
option makes the line a bit thicker.
It now appears that men tend to make more money than women (higher line). Additionally, there may be a stronger relationship between experience and wages for men than for women (steeper line).
Scales
As we’ve seen, the aes()
function is used to map variables to the visual characteristics of a plot. Scales specify how each of these mappings occurs. For example, ggplot2
automatically creates plot axes with tick marks, tick mark labels, and axis labels. Often they look fine, but occasionally you’ll want to take greater control over their appearance. Colors that represent groups are chosen automatically, but you may want to select a different set of colors bases on your tastes or a publication’s requirements.
Scale functions (which start with scale_
) allow you to modify these default scaling. Some common scaling functions are listed in table 2.
Table 2. Some common scale functions
Function |
Description |
|
Scales the x and y axes for quantitative variables. Options include |
|
Same as above for axes representing categorical variables. |
|
Specifies the colors used to represent the levels of a categorical variable. The |
In the next plot, we’ll change the x- and y-axis scaling, and the colors representing males and females. The x-axis representing exper
ranges from 0 to 60 by 10, and the y-axis representing wage
ranges from 0 to 30 by 5. Females are coded with an off-red color and males are coded with an off-blue color. The code below produces the graph in figure 7.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex, shape=sex, linetype=sex)) + geom_point(alpha = .7, size = 3) + geom_smooth(method = "lm", se = FALSE, size = 1.5) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5) + scale_color_manual(values = c("indianred3", "cornflowerblue"))
Figure 7. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex.
The numbers on the x- and y-axes are better, and the colors are more attractive (IMHO), but wages are in dollars. We can change the labels on the y-axis to represent dollars using the scales
package. The scales package provides label formatting for dollars, euros, percents, and more.
Install the scales package (install.packages("scales")
) and then run the following code.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex, shape=sex, linetype=sex)) + geom_point(alpha = .7, size = 3) + geom_smooth(method = "lm", se = FALSE, size = 1.5) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) + scale_color_manual(values = c("indianred3", "cornflowerblue"))
The results are provided in Figure 8.
Figure 8. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex. Wages are printed in dollar format.
We’re definitely getting there. Here’s the next question. Is the relationship between experience, wages and sex the same for each job sector? Let’s repeat this graph once for each job sector in order to explore this.
Facets
Sometimes relationships are clearer if groups appear in side-by-side graphs rather than overlapping in a single graph. Facets reproduce a graph for each level of a given variable (or combination of variables). You can create faceted graphs using the facet_wrap()
and facet_grid()
functions. The syntax is given in table 3, where var, rowvar, and colvar are factors.
Table 3. ggplot2 facet functions
Syntax |
Results |
|
Separate plots for each level of |
|
Separate plots for each level of |
|
Separate plots for each combination of rowvar and colvar, where |
|
Separate plots for each level of rowvar, arranged as a single |
|
Separate plots for each level of colvar, arranged as a single row |
Here, facets are defined by the eight levels of the sector variable. Because each facet is smaller than a one-panel graph alone, we’ll omit size=3
from geom_point()
and size=1.5
from geom_smooth().
This reduces the point and line sizes compared with the previous graphs and looks better in a faceted graph. The code below produces figure 9.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex, shape = sex, linetype = sex)) + geom_point(alpha = .7) + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) + scale_color_manual(values = c("indianred3", "cornflowerblue")) + facet_wrap(~sector)
Figure 9. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex. Separate graphs (facets) are provided for each of 8 job sectors.
It appears that the differences between men and women depend on the job sector under consideration. For example, there’s a strong positive relationship between experience and wages for male managers, but not for female managers. To a lesser extent, this is also true for sales workers. There appears to be no relationship between experience and wages for both male and female service workers. In either case, males make slightly more. Wages go up with experience for female clerical workers, but may go down for male clerical workers (the relationship may not be significant here). We gained a great deal of insight into the relationship of wages and experience at this point.
Labels
Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs()
function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added. Let’s modify each in the following code.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex, shape=sex, linetype=sex)) + geom_point(alpha = .7) + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) + scale_color_manual(values = c("indianred3", "cornflowerblue")) + facet_wrap(~sector) + labs(title = "Relationship between wages and experience", subtitle = "Current Population Survey", caption = "source: http://mosaic-web.org/", x = " Years of Experience", y = "Hourly Wage", color = "Gender", shape = "Gender", linetype = "Gender")
The graph is provided in figure 10.
Figure 10. Scatterplot of worker experience vs. wages with separate graphs (facets) for each of eight job sectors and custom titles and labels.
Now a viewer doesn’t need to guess what the labels expr
and wage
mean, or where the data come from.
Themes
Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_
) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner, more minimalistic theme this time. The code below produces the graph in figure 11.
ggplot(data = CPS85, mapping = aes(x = exper, y = wage, color = sex)) + geom_point(alpha = .6) + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(breaks = seq(0, 60, 10)) + scale_y_continuous(breaks = seq(0, 30, 5), label = scales::dollar) + scale_color_manual(values = c("indianred3", "cornflowerblue")) + facet_wrap(~sector) + labs(title = "Relationship between wages and experience", subtitle = "Current Population Survey", caption = "source: http://mosaic-web.org/", x = " Years of Experience", y = "Hourly Wage", color = "Gender") + theme_minimal()
Figure 11. Scatterplot of worker experience vs. wages with separate graphs (facets) for each of eight job sectors and custom titles and labels, and a cleaner theme.
This is our finished graph, ready for publication. These findings are tentative. They’re based on a limited sample size and don’t involve statistical testing to assess whether differences may be due to chance variation.
That’s all for this article.
If you want to learn more about the book, check it out on our browser-based liveBook reader here.