From R in Action, Third Edition by Robert Kabacoff

This article dicusses graphs and graphic using the ggplot2 package


Take 37% off R in Action, Third Edition by entering fcckabacoff3 into the discount code box at checkout at manning.com.


On many occasions, I’ve presented clients with carefully crafted statistical results in the form of numbers and text, only to have their eyes glaze over as the chirping of crickets permeated the room. Yet those same clients had enthusiastic “Ah-ha!” moments when I presented the same information to them in the form of graphs. Often I can see patterns in data or detect anomalies in data values by looking at graphs—patterns or anomalies that I completely missed when conducting more formal statistical analyses.

Human beings are remarkably adept at discerning relationships from visual representations. A well-crafted graph can help you make meaningful comparisons among thousands of pieces of information, extracting patterns not easily found through other methods. This is one reason why advances in the field of statistical graphics have had such a major impact on data analysis. Data analysts need to look at their data, and this is one area where R shines.

The R language has grown organically over the years, through the contributions of many independent software developers. This has led to the creation of four distinct approaches to graph creation in R – base, lattice, ggplot2, and grid graphics. In this article we’ll focus on ggplot2, the most powerful and popular approach currently available in R.

The ggplot2 package, written by Hadley Wickham (2009a), provides a system for creating graphs based on the grammar of graphics described by Wilkinson (2005) and expanded by Wickham (2009b). The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations.

This chapter walks you through the major concepts and functions used to create ggplot2 graphs by using visualizations to address the following questions:

  • What’s the relationship between a worker’s past experience and their salary?
  • How can we summarize this relationship simply?
  • Is this relationship different for men and women?
  • Does it matter what industry the worker is in?

We’ll start with a simple scatterplot displaying the relationship between workers’ experience and wages. Then in each section, we’ll add new features until we’ve produced a single publication quality plot that addresses these questions. At each step, we’ll hopefully gain greater insight into the questions presented.

To answer these questions, we’ll use the CPS85 data frame contained in the mosaicData package. The data frame contains a random sample of 534 individuals selected from the 1985 Current Population Survey, and includes information their wages, demographics, and work experience. Be sure to install both the mosaicData and ggplot2 packages before continuing (install.packages(c("mosaicData", "ggplot2))).

Creating a graph with ggplot2

The ggplot2 package uses a series of functions to build up a graph in layers. We’ll build a complex graph by starting with a simple graph and adding additional elements, one at a time. By default, ggplot2 graphs appear on a grey background with white reference lines. We’ll start by setting the default theme to a white background with light grey reference lines. This looks better when printed in black and white. Let’s load the ggplot2 package and set this default theme.

 
 library(ggplot2)
 theme_set(theme_bw())
  

ggplot

The first function in building a graph is the ggplot() function. It specifies the

  • data frame containing the data to be plotted
  • the mapping of the variables to visual properties of the graph. The mappings are placed in an aes() function (which stands for aesthetics or “something you can see”).

The code below produces the graph in figure 1.

 
 library(ggplot2)
 library(mosaicData)
 ggplot(data = CPS85, mapping = aes(x = exper, y = wage))
  

Figure 1. Mapping worker experience and wages to the x- and y-axes


Why is the graph empty? We specified that the exper variable should be mapped to the x-axis and that the wage variable should be mapped to the y-axis, but we haven’t yet specified what we wanted placed on the graph. In this case, we want points to represent each participant.

Geoms

Geoms are the geometric objects (points, lines, bars, and shaded regions) which can be placed on a graph. They’re added using functions that start with the phrase geom_. Currently, thirty-seven different geoms are available and the list is growing. Table 1 describes the more common geoms, along with frequently used options for each.

Table 1. Geom functions

Function

Adds

Options

geom_bar()

Bar chart

color, fill, alpha

geom_boxplot()

Box plot

color, fill, alpha, notch, width

geom_density()

Density plot

color, fill, alpha, linetype

geom_histogram()

Histogram

color, fill, alpha, linetype, binwidth

geom_hline()

Horizontal lines

color, alpha, linetype, size

geom_jitter()

Jittered points

color, size, alpha, shape

geom_line()

Line graph

colorvalpha, linetype, size

geom_point()

Scatterplot

color, alpha, shape, size

geom_rug()

Rug plot

color, side

geom_smooth()

Fitted line

method, formula, color, fill, linetype, size

geom_text()

Text annotations

Many; see the help for this function

geom_violin()

Violin plot

color, fill, alpha, linetype

geom_vline()

Vertical lines

color, alpha, linetype, size

We’ll add points using the geom_point() function, creating a scatterplot. In ggplot2 graphs, functions are chained together using the + sign to build a final plot.

 
 library(ggplot2)
 library(mosaicData)
 ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) +
   geom_point()
  

The results can be seen in figure 2.


Figure 2. Scatterplot of worker experience vs. wages


It appears that as experience goes up, wages go up, but the relationship is weak. The graph also indicates that there is an outlier. One individual has a wage much higher than the rest. We’ll delete this case and reproduce the plot.

 
 CPS85 <- CPS85[CPS85$wage < 40, ]
 ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) +
   geom_point()
  

The new graph is displayed in figure 3.


Figure 3 Scatterplot of worker experience vs. wages with outlier removed


A number of options can be specified in a geom_ function (see table 1). Options for geom_point()include color, size, shape, and alpha. These control the point color, size, shape, and transparency, respectively. Colors can be specified by name or hexadecimal code. Shape and linetype can be specified by the name or number representing the pattern or symbol respectively. Point size is specified with positive real numbers starting at zero. Large numbers produce larger point sizes. Transparency ranges from 0 (completely transparent) to 1 (completely opaque). Adding a degree of transparency can help visualize overlapping points.

Let’s make the points in figure 3 larger, semi-transparent, and blue. The code below produces the graph in figure 4.

 
 ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) +
   geom_point(color = "cornflowerblue", alpha = .7, size = 3)
  

We’ll also change the gray background to white using theme. I might argue that the chart is more attractive (at least if you have color output), but it doesn’t add to our insights. It would be helpful if the graph had a line summarizing the trend between experience and wages.


Figure 4. Scatterplot of worker experience vs. wages with outlier removed with modified point color, transparency, and point size


We can add this line with the geom_smooth() function. Options control the type of line (linear, quadratic, nonparametric), the thickness of the line, the line’s color, and the presence or absence of a confidence interval. Here we request a linear regression (method = lm) line (where lm stands for linear model).

 
 ggplot(data = CPS85, mapping = aes(x = exper, y = wage)) +
   geom_point(color = "cornflowerblue", alpha = .7, size = 3) +
   geom_smooth(method = "lm")
  

The results are given in figure 5.


Figure 5 Scatterplot of worker experience vs. wages with a line of best fit


We can see from this line that on average, wages appear to increase to a moderate degree with experience. We are only using two geoms in this example.

Grouping

In the previous section, we set graph characteristics such as color and transparency to a constant value. We can also map variables values to the color, shape, size, transparency, line style, and other visual characteristics of geometric objects. This allows groups of observations to be superimposed in a single graph (a process called grouping).

Let’s add sex to the plot and represent it by color, shape, and linetype.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage,
                      color = sex, shape = sex, linetype = sex)) +
   geom_point(alpha = .7, size = 3) +
   geom_smooth(method = "lm", se = FALSE, size = 1.5)
  

By default, the first group (female) is represented by pink filled circles and a solid pink line, and the second group (male) is represented by teal filled triangles and a dashed teal line. The new graph is presented in figure 6.


Figure 6. Scatterplot of worker experience vs. wages with points colored by sex and separate line of best fit for men and women.


Note that the color=sex, shape=sex, and linetype=sex, options are placed in the aes() function because we’re mapping a variable to an aesthetic. The geom_smooth option (se = FALSE) was added to suppresses the confidence intervals, making the graph less busy and easier to read. The size = 1.5 option makes the line a bit thicker.

Simplifying Graphs

In general, our goal is to create graphs which are as simple as possible while conveying the information accurately. In the graphs in this article, I’d probably map gender to color alone. Adding mappings to shape and line type make the graphs unnecessarily busy.

It now appears that men tend to make more money than women (higher line). Additionally, there may be a stronger relationship between experience and wages for men than for women (steeper line).

Scales

As we’ve seen, the aes() function is used to map variables to the visual characteristics of a plot. Scales specify how each of these mappings occurs. For example, ggplot2 automatically creates plot axes with tick marks, tick mark labels, and axis labels. Often they look fine, but occasionally you’ll want to take greater control over their appearance. Colors that represent groups are chosen automatically, but you may want to select a different set of colors bases on your tastes or a publication’s requirements.

Scale functions (which start with scale_) allow you to modify these default scaling. Some common scaling functions are listed in table 2.

Table 2. Some common scale functions

Function

Description

scale_x_continuous(), scale_y_continuous()

Scales the x and y axes for quantitative variables. Options include breaks for specifying tick marks, labels for specifying tick mark labels, and limits  to control the range of the values displayed.

scale_x_discrete(), scale_y_discrete()

Same as above for axes representing categorical variables.

scale_color_manual()

Specifies the colors used to represent the levels of a categorical variable. The values option specifies the colors. A table of colors can be found at http://research.stowers.org/mcm/efg/R/Color/Chart/ColorChart.pdf

In the next plot, we’ll change the x- and y-axis scaling, and the colors representing males and females. The x-axis representing exper ranges from 0 to 60 by 10, and the y-axis representing wage ranges from 0 to 30 by 5. Females are coded with an off-red color and males are coded with an off-blue color. The code below produces the graph in figure 7.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage,
                      color = sex, shape=sex, linetype=sex)) +
    geom_point(alpha = .7, size = 3) +
    geom_smooth(method = "lm", se = FALSE, size = 1.5) +
    scale_x_continuous(breaks = seq(0, 60, 10)) +
    scale_y_continuous(breaks = seq(0, 30, 5) +
    scale_color_manual(values = c("indianred3", "cornflowerblue"))
  

Figure 7. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex.


The numbers on the x- and y-axes are better, and the colors are more attractive (IMHO), but wages are in dollars. We can change the labels on the y-axis to represent dollars using the scales package. The scales package provides label formatting for dollars, euros, percents, and more.

Install the scales package (install.packages("scales")) and then run the following code.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage,
                                    color = sex, shape=sex, linetype=sex)) +
    geom_point(alpha = .7, size = 3) +
    geom_smooth(method = "lm", se = FALSE, size = 1.5) +
    scale_x_continuous(breaks = seq(0, 60, 10)) +
    scale_y_continuous(breaks = seq(0, 30, 5),
                       label = scales::dollar) +
    scale_color_manual(values = c("indianred3", "cornflowerblue"))
  

The results are provided in Figure 8.


Figure 8. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex. Wages are printed in dollar format.


We’re definitely getting there. Here’s the next question. Is the relationship between experience, wages and sex the same for each job sector? Let’s repeat this graph once for each job sector in order to explore this.

Facets

Sometimes relationships are clearer if groups appear in side-by-side graphs rather than overlapping in a single graph. Facets reproduce a graph for each level of a given variable (or combination of variables). You can create faceted graphs using the facet_wrap() and facet_grid() functions. The syntax is given in table 3, where var, rowvar, and colvar are factors.

Table 3. ggplot2 facet functions

Syntax

Results

facet_wrap(~var, ncol=n)

Separate plots for each level of var arranged into n columns

facet_wrap(~var, nrow=n)

Separate plots for each level of var arranged into n rows

facet_grid(rowvar~colvar)

Separate plots for each combination of rowvar and colvar, where rowvar represents rows and colvar represents columns

facet_grid(rowvar~.)

Separate plots for each level of rowvar, arranged as a single
column

facet_grid(.~colvar)

Separate plots for each level of colvar, arranged as a single row

Here, facets are defined by the eight levels of the sector variable. Because each facet is smaller than a one-panel graph alone, we’ll omit size=3 from geom_point()and size=1.5 from geom_smooth(). This reduces the point and line sizes compared with the previous graphs and looks better in a faceted graph. The code below produces figure 9.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage,
                      color = sex, shape = sex, linetype = sex)) +
   geom_point(alpha = .7) +
   geom_smooth(method = "lm", se = FALSE) +
   scale_x_continuous(breaks = seq(0, 60, 10)) +
   scale_y_continuous(breaks = seq(0, 30, 5),
                      label = scales::dollar) +
   scale_color_manual(values = c("indianred3", "cornflowerblue")) +
   facet_wrap(~sector)
  

Figure 9. Scatterplot of worker experience vs. wages with custom x– and y-axes and custom color mappings for sex. Separate graphs (facets) are provided for each of 8 job sectors.


It appears that the differences between men and women depend on the job sector under consideration. For example, there’s a strong positive relationship between experience and wages for male managers, but not for female managers. To a lesser extent, this is also true for sales workers.  There appears to be no relationship between experience and wages for both male and female service workers. In either case, males make slightly more. Wages go up with experience for female clerical workers, but may go down for male clerical workers (the relationship may not be significant here). We gained a great deal of insight into the relationship of wages and experience at this point.

Labels

Graphs should be easy to interpret and informative labels are a key element in achieving this goal. The labs() function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added. Let’s modify each in the following code.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage,
                  color = sex, shape=sex, linetype=sex)) +
     geom_point(alpha = .7) +
     geom_smooth(method = "lm", se = FALSE) +
     scale_x_continuous(breaks = seq(0, 60, 10)) +
     scale_y_continuous(breaks = seq(0, 30, 5),
                        label = scales::dollar) +
     scale_color_manual(values = c("indianred3",
                                 "cornflowerblue")) +
     facet_wrap(~sector) +
     labs(title = "Relationship between wages and experience",
        subtitle = "Current Population Survey",
        caption = "source: http://mosaic-web.org/",
        x = " Years of Experience",
        y = "Hourly Wage",
        color = "Gender", shape = "Gender", linetype = "Gender")
  

The graph is provided in figure 10.


Figure 10. Scatterplot of worker experience vs. wages with separate graphs (facets) for each of eight job sectors and custom titles and labels.


Now a viewer doesn’t need to guess what the labels expr and wage mean, or where the data come from.

Themes

Finally, we can fine tune the appearance of the graph using themes. Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph. Let’s use a cleaner, more minimalistic theme this time. The code below produces the graph in figure 11.

 
 ggplot(data = CPS85,
        mapping = aes(x = exper, y = wage, color = sex)) +
   geom_point(alpha = .6) +
   geom_smooth(method = "lm", se = FALSE) +
   scale_x_continuous(breaks = seq(0, 60, 10)) +
   scale_y_continuous(breaks = seq(0, 30, 5),
                      label = scales::dollar) +
   scale_color_manual(values = c("indianred3", "cornflowerblue")) +
   facet_wrap(~sector) +
   labs(title = "Relationship between wages and experience",
        subtitle = "Current Population Survey",
        caption = "source: http://mosaic-web.org/",
        x = " Years of Experience",
        y = "Hourly Wage",
        color = "Gender") +
   theme_minimal()
  

Figure 11. Scatterplot of worker experience vs. wages with separate graphs (facets) for each of eight job sectors and custom titles and labels, and a cleaner theme.


This is our finished graph, ready for publication. These findings are tentative. They’re based on a limited sample size and don’t involve statistical testing to assess whether differences may be due to chance variation.

That’s all for this article.

If you want to learn more about the book, check it out on our browser-based liveBook reader here.