|
From Machine Learning with R, tidyverse, and mlr by Hefin Rhys This article covers:
|
Take 37% off Machine Learning with R, tidyverse, and mlr. Just enter fccrhys into the discount code box at checkout at manning.com.
I’m excited to start teaching machine learning to you, but before we dive into that (in my book: Machine Learning with R, tidyverse, and mlr), I want to teach you some skills that are going to make your learning experience simpler and more effective. These skills are also going to improve your general data science and R programming skills.
Imagine I asked you to build me a car (a typical request between friends). You could go old-fashioned: you could purchase the metal, glass and all the components, hand cut all the pieces, hammer them into shape and rivet them together. The car may look beautiful and work perfectly, but it would take a long time and be hard for you to remember exactly what you did if you had to make another one.
Instead, you could use a modern approach using robotic arms in your factory. You could program them to cut and bend the pieces to their pre-defined shapes, and have them assemble the pieces for you. In this scenario, building a car would be much faster and simpler for you, and it would be easy for you to reproduce the same process in future.
Now imagine I made a more reasonable request of you and asked you to reorganize and plot a dataset, to make it ready for a machine learning pipeline. You could use base R functions for this, and they’d work fine. But the code would be long, wouldn’t be human-readable (in a month you’d struggle to remember what you did), and the plots would be cumbersome to produce.
Instead, you could use a more modern approach and use functions from the tidyverse family of packages. These functions help make the data manipulation process simpler, human-readable, and they allow you to produce attractive graphics with minimal typing.
What is the tidyverse and what’s tidy data?
The purpose of this article is to give you the skills to apply machine learning approaches to your data. Although it isn’t my intention to cover all aspects of data science (nor could I in an article), I do want to introduce you to the tidyverse. Before you can input your data into a machine learning algorithm, it needs to be in a format which the algorithm is happy to work with.
The tidyverse is an “opinionated collection of R packages designed for data science,” created for the purpose of making data science tasks in R simpler, more human-readable, and more reproducible. The packages are “opinionated” because they’re designed to make tasks the package authors consider to be good practice, easy, and make tasks they consider to be bad practice, difficult. The name comes from the concept of tidy data, a data structure by where:
- Each row represents a single observation
- Each column represents a variable.
Take a look at the data in table 1. Imagine we take four runners, and put them on a new training regime. We want to know if the regime is improving their running times, and we record their best times before the new training starts (Month 0), and for three months after.
Table 1 An example of untidy data. This table contains the running times for four runners, taken immediately before starting a new training regime, and then for three months after.
Athlete |
Month 0 |
Month 1 |
Month 2 |
Month 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is an example of untidy data. Can you see why? Well, let’s go back to our rules. Does each row represent a single observation? Nope. In fact, we have four observations per row (one for each month). Does each column represent a variable? Nope. Only three variables are in this data: the athlete, the month, and the best time, and yet we have five columns!
How would the same data look in tidy format? Table 2 shows you.
Table 2 This table contains the same data as table 1, but in tidy format
Athlete |
Month |
Best |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This time, we have the column Month
that contains the month identifiers that were previously used as separate columns, and the Best
column, which holds the best times for each athlete, for each month. Does each row represent a single observation? Yes! Does each column represent a variable? Yes! This data is in tidy format.
Ensuring that your data is in tidy format is an important early step in any machine learning pipeline, and the tidyverse includes the package tidyr, which helps you achieve this. The other packages in the tidyverse work with tidyr and each other to help you:
- Organize and display your data in a sensible way (tibble)
- Manipulate and subset your data (dplyr)
- Plot your data (ggplot2)
All of the operations available to you in the tidyverse are achievable using base R code, but I strongly suggest you incorporate the tidyverse in your work as it helps you keep your code simpler, more human-readable, and reproducible.
- readr, for reading data into R from external files
- purrr, for replacing loops with a functional programming approach
- forcats, for working with factors
- stringr, for working with strings
Loading the tidyverse
The packages of the tidyverse can all be installed and loaded together (recommended):
install.packages(“tidyverse”) library(tidyverse)
or installed and loaded individually as needed:
install.packages(c(“tibble”, “dplyr”, “ggplot2”, “tidyr”)) library(tibble) library(dplyr) library(ggplot2) library(tidyr)
What the tibble package is and what it does
If you’ve done any form of data science or analysis in R, you’ll surely have come across data frames as a structure for storing rectangular data.
Data frames work fine and, for a long time, were the only way to store rectangular data with columns of different types (in contrast to matrices which can only handle numeric data), but little has been done to improve the aspects of data frames that data scientists dislike.
Data is rectangular if each row has a number of elements equal to the number of columns and each column has a number of elements equal to the number of rows. Data isn’t always of this kind!
The tibble package introduces a new data structure, the tibble, to “keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating” (https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html). Let’s see what’s meant by this.
Creating tibbles
Creating tibbles with the tibble()
function, works the same as creating data frames:
Listing 1 Creating tibbles with tibble()
myTib <- tibble(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) myTib # A tibble: 4 x 2 ❶ x y ❷ <int> <chr> ❸ 1 1 london 2 2 beijing 3 3 las vegas 4 4 berlin
❶ Tells us it’s a tibble of 4 rows and 2 columns
❷ Variable names
❸ Variable classes <int> = integer, <chr> = character
If you’re used to working with data frames, you’ll immediately notice two differences in how they’re printed:
- Tibbles tell you they’re a tibble and their dimensions when you print them
- Tibbles tell you the class of each variable
This second feature is particularly useful in avoiding errors due to incorrect variable types.
Tip When printing a tibble, <int>
denotes an integer variable, <chr>
denotes a character variable, <dbl>
denotes a floating point number (decimal), and <lgl>
denotes a logical variable.
Converting existing data frames into tibbles
As you can coerce objects into data frames using the as.data.frame()
function, you can also coerce objects into tibbles using the as.tibble()
function:
Listing 2 Converting data frames to tibbles
myDf <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) dfToTib <- as_tibble(myDf) dfToTib # A tibble: 4 x 2 x y <int> <fct> 1 1 london 2 2 beijing 3 3 las vegas 4 4 berlin
2.3.3. Differences between data frames and tibbles
If you’re used to working with data frames, you’ll notice a few differences with tibbles. I’ve summarized the most notable differences between data frames and tibbles in this section.
Tibbles don’t convert your data types
A common frustration people have when creating data frames, is that, by default, they’ll convert string variables to factors by default. This can be annoying because it may not be the best way to handle the variables. To prevent this conversion, you must supply the stringsAsFactors = FALSE
argument when creating a data frame.
In this article we’re working with data already built into R. Often, we need to read data into our R session from a .csv file. To load the data as a tibble, you use the read_csv() function. read_csv() comes from the readr package which is loaded when you call library(tidyverse) and it’s the tidyverse version of read.csv().
In contrast, tibbles don’t convert string variables to factors by default. This behavior is desirable because automatic conversion of data to certain types can be a frustrating source of bugs:
Listing 3 Tibbles don’t convert strings to factors
myDf <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) myDfNotFactor <- data.frame(x = 1:4, y = c("london", "beijing", "las vegas", "berlin"), stringsAsFactors = FALSE) myTib <- tibble(x = 1:4, y = c("london", "beijing", "las vegas", "berlin")) class(myDf$y) [1] "factor" class(myDfNotFactor$y) [1] "character" class(myTib$y) [1] "character"
If you want a variable to be a factor in a tibble, you wrap the c()
function inside factor()
:
myTib <- tibble(x = 1:4, y = factor(c("london", "beijing", "las vegas", "berlin"))) myTib
Concise output, regardless of data size
When you print a data frame, all the columns are printed to the console (by default), making it difficult to view early variables and cases. When you print a tibble, it only prints the first ten rows and the number of columns that fit on your screen (by default), making it easier to get a quick understanding of the data. Note that the names of variables that aren’t printed, are listed at the bottom of the output. Run the code in listing 4 and contrast the output of the starwars tibble (which is included with dplyr and available when you call library(tidyverse)
) to how it looks when converted into a data frame:
Listing 4 The starwars data as a tibble and data frame
data(starwars) starwars as.data.frame(starwars)
Subsetting with [
always returns another tibble
When subsetting a data frame, the [
operator returns another data frame if you keep more than one column, or a vector if you keep only one. When subsetting a tibble, the [
operator always returns another tibble. If you wish to explicitly return a tibble column as a vector, use either the [[
or $
operators instead. This behavior is desirable because we should be explicit in whether we want a vector or rectangular data structure to avoid bugs:
Listing 5 Subsetting tibbles with [
, [[
, and $
myDf[, 1] [1] 1 2 3 4 myTib[, 1] # A tibble: 4 x 1 x <int> 1 1 2 2 3 3 4 4 myTib[[1]] [1] 1 2 3 4 myTib$x [1] 1 2 3 4
Variables are created sequentially
When building a tibble, variables are created sequentially and that latter variables can reference earlier defined ones. This means we can create variables on-the-fly that refer to other variables in the same function call:
Listing 6 Variables are created sequentiall
sequentialTib <- tibble(nItems = c(12, 45, 107), cost = c(0.5, 1.2, 1.8), totalWorth = nItems * cost) sequentialTib # A tibble: 3 x 3 nItems cost totalWorth <dbl> <dbl> <dbl> 1 12 0.5 6 2 45 1.2 54 3 107 1.8 193
That’s all for this article. If you want to learn more about the book, check it out on liveBook here and see this slide deck.