Description: https://images.manning.com/360/480/resize/book/e/542293a-5ae6-432b-a2c0-c388b32affd6/Kaminski-MEAP-HI.png

An excerpt from Julia for Data Analysis by Bogumil Kaminski

This article dives into working with data in dataframes with Julia.

Read it if you’re a data scientist or anyone who works with lots of data, and if you’re interested in the Julia language.


Take 25% off Julia for Data Analysis by entering fcckaminski into the discount code box at checkout at manning.com.


Fetching, unpacking, and inspecting the data

To work with the database of puzzles available on Lichess we first need to download it from the web. Next, we will uncompress it so that later it will be possible to read it into a DataFrame. I will show you how to unpack data stored in the bzip2 archive. However, the same approach can be used to uncompress archives created in other formats. Data compression is often used as it reduces storage size or transfer time so you can expect that knowledge how to handle such data is useful in practice.

REPRODUCIBILITY NOTE In the GitHub repository containing the sources of codes for this book I have included the puzzles.csv.bz2 file that we create in this section to ensure reproducibility of the results presented in this article. The Lichess puzzles database is constantly updated, so if you will choose to download its latest version you can expect to get slightly different results. 

Downloading the file from the web

Since the downloaded file is large I add a check if it already exists and then avoid fetching it again:

 
 julia> import Downloads
  
 julia> if isfile("puzzles.csv.bz2") #A
            @info "file already present" #B
        else
            @info "fetching file" #C
            Downloads.download("https://database.lichess.org/" * #C
                               "lichess_db_puzzle.csv.bz2", #C
                               "puzzles.csv.bz2") #C
        end
 [ Info: file already present
  

#A check if the file is already present

#B if yes just print the information confirming it

#C if not inform the user that the data needs to be fetched from the web

I use the @info macro to print an appropriate status message. In the above printout I show the result for the case when the puzzles.csv.bz2 was already present in the working directory as in such a case the isfile("puzzles.csv.bz2") check produces true.

Working with bzip2 archives

The database is compressed using the bzip2 (https://www.sourceware.org/bzip2/) algorithm which is signaled by the bz2 file extension. We will use the CodecBzip2.jl package to decompress it. We first read the contents of the file as a vector of UInt8 values (that is bytes), and next decompress it also to a vector of bytes using the transcode function:

 
 julia> using CodecBzip2
  
 julia> compressed = read("puzzles.csv.bz2") #A
 94032447-element Vector{UInt8}:
  0x42
  0x5a
  0x68
     ⋮
  0x49
  0x5f
  0x30
  
 julia> plain = transcode(Bzip2Decompressor, compressed) #B
 366020640-element Vector{UInt8}:
  0x30
  0x30
  0x30
     ⋮
  0x32
  0x30
  0x0a
  

#A read compressed data into a vector of bytes

#B decompress the data using the Bzip2Decompressor codec

Note that the compressed data has 94,032,447 bytes and after decompression it becomes 366,020,640 bytes. Thus, the compression ratio for this data set is around 4:

 
 julia> length(plain) / length(compressed)
 3.892492981704496
  

Most likely we might want to get back to our uncompressed data several times. Let us then write it to a puzzles.csv file:

 
 julia> open("puzzles.csv", "w") do io
            println(io, "PuzzleId,FEN,Moves,Rating,RatingDeviation," *
                        "Popularity,NbPlays,Themes,GameUrl") #A
            write(io, plain) #B
        end
 366020640
  

#A write to io the text representation of the second passed argument followed by a newline

#B write to io the binary representation of the second passed argument

Here we used the pattern involving the open function with the doend block we have already seen in chapter 6. What is new is the use of the write function. It is used to write binary representation of data to a file. In our case, since plain is Vector{UInt8} we just write its raw contents to the file. Note that before writing the uncompressed data stored in the plain vector I have written a string to this file using the println function. This was needed because, as soon we will learn, the original CSV data did not have a header with column names. I have used column names given on the Lichess website (https://database.lichess.org/#puzzles).

Inspecting the CSV file

Let us quickly inspect the contents of the puzzles.csv file:

 
 julia> readlines("puzzles.csv")
  

Running this command gives the following output in a terminal:



Indeed, the file looks as a properly formatted CSV file. Such a file format is a popular way to store tabular data. The CSV file format is specified as follows:

  • the first line of the file contains column names separated by commas (,);
  • for the following lines, each line contains information about a single observation (record) of our data; in a single line commas separate cells that refer to consecutive columns of our table; the number of columns in each row must be equal to the number of column names defined in the first row of data.

Loading the data to a data frame

Now that we have uncompressed the data let us load it into a data frame. Our Lichess data is stored in CSV format, and I have chosen this example intentionally. The reason was that it is one of the most popular human readable data formats used in practice. It can be easily read and written by spreadsheet editors. Therefore, it is worth to know how to work with CSV files in Julia.

Reading a CSV file into a data frame

The DataFrame type defined in the DataFrames.jl library is one of the more popular options that you can use to store tabular data in-memory in Julia. To read-in the puzzles.csv file from disk to a DataFrame use the CSV.read function from the CSV.jl package:

 
 julia> using CSV
  
 julia> using DataFrames
  
 julia> puzzles = CSV.read("puzzles.csv", DataFrame);
  

In the last expression I have used the semicolon (;) to suppress printing the data frame contents to screen.

The CSV.read function can read data not only from a file whose name is passed as a string, but it can also be directly passed a source that provides a sequence of bytes that contain the data that should be read in. In our case we have such source, as it is a binary vector bound to the plain variable. Therefore, alternatively we could have created our data frame by writing:

 
 julia> puzzles2 = CSV.read(plain, DataFrame;
                            header=["PuzzleId", "FEN", "Moves",
                                    "Rating","RatingDeviation",
                                    "Popularity", "NbPlays",
                                    "Themes","GameUrl"]); #A
 julia> puzzles == puzzles2 #B
 true
  

#A read the data from the vector of bytes while passing column names using the header keyword argument

#B check that puzzles and puzzles2 data frames are identical

Note that in this case we have passed the header keyword argument to the CSV.read function as our original data did not have column names. Next we compared two data frames using the == operator to make sure they are identical.

Further we will not need the values bound to compressed and plain variables. Therefore, to allow Julia to free memory allocated by these objects we bind nothing to both variables:

 
 julia> compressed = nothing
  
 julia> plain = nothing
  

Inspecting the contents of a data frame

Let us have a peek at the puzzles data frame:

Listing 1. Printing a sample data frame to the screen

 
 julia> puzzles
  

The output of listing 8.1 is cropped as you can see in the following image:



Cropping is indicated by triple dots and the message in the bottom right of the printout, where we learn that there are 7 more columns and 2,123,983 rows that were not fully printed. The exact output you get when you run this command on your computer depends on the window size where it is displayed.

When the puzzles data frame is printed then in the header you get information about shown column names and their element types. Each row in our data frame is a single puzzle description. Observe that the "PuzzleId" column uses 5 characters to encode puzzle identifier. The CSV.read function automatically detected this fact and read this column in by storing the strings using the String7 type. On the other hand, "FEN" and "Moves" columns are wider and therefore the String type was used.

After reading in the data, it is a good practice to check if the process produced expected results in all columns. To get a quick view into a summary statistics of some data frame use the describe function.

Listing 2. Getting summary statistics of the data frame columns

 
 julia> describe(puzzles)
  

You can see the result in the following picture (I wanted to show you the whole default output of the describe function which is wide so the text on image is small; therefore, I recommend you run this command on your computer to get a more legible view):



The describe function returns a new data frame in which each row contains information about a single column of the original data frame. By default, describe produces the following statistics for each source column:

  • "variable": name stored as Symbol;
  • "mean": average of values if the column contains numeric data;
  • "min": minimum value if the column contains data that can have defined order;
  • "median": median of values if the column contains numeric data;
  • "max": maximum value if the column contains data that can have defined order;
  • "nmissing": number of missing values;
  • "eltype": type of values stored.

The describe function allows you to additionally specify the statistics you want to compute (the ones listed above are the default) and select for which columns the summary statistics should be computed. Please refer to the documentation (https://dataframes.juliadata.org/stable/lib/functions/#DataAPI.describe) if you would like to learn the details.

Given the information presented in the summary statistics in listing 2 we are ready to give an interpretation of the columns stored in the puzzles data frame:

  • "PuzzleId": a unique identifier of the puzzle;
  • "FEN": an encoding of a starting position of the puzzle;
  • "Moves": moves that are a solution of the puzzle;
  • "Rating": difficulty of the puzzle;
  • "RatingDeviation": accuracy of assessment of difficulty of the puzzle;
  • "Popularity": how much the puzzle is liked by the users (the higher the better);
  • "NbPlays": number of times a given puzzle was played;
  • "Themes": description of chess themes that are featured by the puzzle;
  • "GameUrl": URL to the source game from which the puzzle was taken.

Before we move forward let me mention three functions that are very commonly used when working with data frames: the ncol, nrow, and names functions.

The ncol function returns the number of columns in a data frame:

 
 julia> ncol(puzzles)
 9
  

The nrow function returns the number of rows in a data frame:

 
 julia> nrow(puzzles)
 2132989
  

Finally, the names function returns a vector of column names in our data frame (this function has more features, but they are outside of the scope of this article):

 
 julia> names(puzzles)
 9-element Vector{String}:
  "PuzzleId"
  "FEN"
  "Moves"
  "Rating"
  "RatingDeviation"
  "Popularity"
  "NbPlays"
  "Themes"
  "GameUrl"
  

Saving a data frame to a CSV file

Before we wrap up this section let me show you how you can save a data frame back to a CSV file. You can do it using the CSV.write function where the first argument is the target file name and the second argument is the table you want to save:

 
 julia> CSV.write("puzzles2.csv", puzzles)
 "puzzles2.csv"
  

In our code we saved the puzzles data frame to the puzzles2.csv file.

Now it would be interesting to check if the original puzzles.csv file and puzzles2.csv files are identical. To perform this test we will use the read function, which when passed a file as a single argument, returns a Vector{UInt8} that contains bytes read from the file. Here is an example:

 
 julia> read("puzzles2.csv")
 386223179-element Vector{UInt8}:
  0x50
  0x75
  0x7a
     ⋮
  0x32
  0x30
  0x0a
  

Therefore, we can check that files puzzles.csv and puzzles2.csv are identical, by comparing the result of the read function applied to them:

 
 julia> read("puzzles2.csv") == read("puzzles.csv")
 true
  

Indeed, both files contain identical data.

Getting a column out of a data frame

To be able to perform our analysis, we need to learn how to get data out of a data frame. The most common operation of this kind is extracting a single column. DataFrames.jl provides several options how this can be done. Let us investigate them one by one.

To keep the focus on a task of analysis of the Lichess puzzles data we will specifically want to create a histograms of "Rating", "RatingDeviation", "Popularity", and "NbPlays" columns from the puzzles data frame as they will be used in our later analysis.

Data frame’s storage model

Internally, a DataFrame object stores data as a collection of vectors. Each vector represents one column of a data frame and is assigned a name and a number. Let us visualize it in table 1.

Table 1. Structure of the puzzles data frame

Column #

Column name

Column vector

1

"PuzzleId"

["00008", "0000D", "0009B", "000aY", ...]

2

"FEN"

["r6k/pp2r2p/ ... /7K b - - 0 24", ...]

3

"Moves"

["f2g3 e6e7 b2b1 b3c1 b1c1 h6c1", ...]

4

"Rating"

[1765, 1525, 1102, 1320, 1560, 1039, ...]

5

"RatingDeviation"

[74, 74, 75, 74, 76, 80, 75, ...]

6

"Popularity"

[93, 97, 85, 92, 88, 85, 80, ...]

7

"NbPlays"

[493, 9211, 503, 395, 441, 54, ...]

8

"Themes"

["crushing ... middlegame", ...]

9

"GameUrl"

["https://lichess.org/.../black#48", ...]

For example, internally column number 4 has name "Rating" and stores a vector of integers that represent puzzle’s difficulties: [1765, 1525, 1102, 1320, 1560, ...].

The storage layout of a data frame was chosen to ensure that operations that are performed on columns of some data frame are very fast. We will discuss multiple such operations in this book. Let us start with the simplest one: extracting a column from a data frame.

Treating a data frame column as a property

You can access fields of NamedTuple or composite type using a dot (.) followed by a field name. The same syntax allows accessing columns of a data frame.

If you consider the puzzles data frame and want to extract out the "Rating" column from it just write:

 
 julia> puzzles.Rating
 2132989-element Vector{Int64}:
  1765
  1525
  1102
     ⋮
   980
  1783
  2481
  

The dot (.) allows the user to access the fields of a struct. How is it then possible that the DataFrame type allows the user to access its columns using this syntax? The reason is that Julia makes a distinction between fields of a struct object and its properties. When you use the dot (.) syntax you get access to properties of an object. By default, properties of the object are the same as its fields, but it is possible to override this behavior. This is exactly what is done for the DataFrame type. Instead of exposing its fields it allows users to access its columns using the dot (.) syntax since this is much more useful in practice.

Note that field names are part of a type definition, so every value having this type has the same fields. Conversely, if some type overrides the definition of properties, then they can be different across values having the same type. For instance, all values having the DataFrame type have the same fields, but their properties depend on the column names that a given data frame stores.

It is possible to get the list of fields of DataFrame type using the fieldnames function. If you called fieldnames(DataFrame) you would get the following tuple (:columns, :colindex). They respectively store the vectors constituting the columns of a DataFrame and a mapping between column names and their numbers. If you wanted to extract out the fields from a value having the DataFrame type you could use the getfield function.

In figure 1 I show how the fields and properties of two example data frames are related.

However, although it is technically possible, you should never directly extract out the fields from a data frame. The internal layout of the DataFrame type is considered private and might change in the future.


Figure 1. Comparison between field and property names for two different DataFrame values. Both values have the same property names as they are a part of a definition of the DataFrame type. Conversely, since both data frames have different columns their property names are different as for the DataFrame type properties are defined to correspond to column names of a given instance. In case of data frame objects, you should not directly access their fields as they are considered private.


Let us go back to the topic of this section and check how fast the operation of getting a column from a data frame is using the @benchmark macro:

 
 julia> using BenchmarkTools
  
 julia>  @btime $puzzles.Rating;
   7.600 ns (0 allocations: 0 bytes)
  

The operation is very fast; it took only a few nanoseconds. The reason is that accessing a data frames’s column this way is performed without copying the data. By writing puzzles.Rating you get the same data that is referenced to by the puzzles object. The only operations that Julia needs to perform to get the column from a data frame is to retrieve the information from the :colindex private field that "Rating" has column number 4 and then extract it out from the :columns private field.

This behavior has a clear performance benefit. However, you might ask what to do if you want to get a copy of the vector. This is not an artificial question. In practice you might expect that you later would want to modify it and not want to change the source vector.

An established way in Julia to copy some object is to call the copy function on it so by writing copy(puzzles.Rating) you get a copy of the vector stored in the puzzles data frame. However, when you compare the puzzles.Rating and copy(puzzles.Rating) you will learn that they are equal:

 
 julia> puzzles.Rating == copy(puzzles.Rating)
 true
  

This shows us that using == to test two vectors for equality compares their contents and not their memory location. Is it then possible to compare vectors in a way that would check if they were the same objects (in a sense that no Julia program could distinguish between them)? Indeed, there is. You can achieve this using the === comparison:

 
 julia> puzzles.Rating === copy(puzzles.Rating)
 false
  

And we see that these two objects are not the same (although they have the same contents).

If we compared puzzles.Rating against puzzles.Rating using === we get true as this time it is the same object indeed:

 
 julia> puzzles.Rating === puzzles.Rating
 true
  

On the other hand, two copies are different, as expected:

 
 julia> copy(puzzles.Rating) === copy(puzzles.Rating)
 false
  

You can also use a string literal after the dot (.) when getting the column from a data frame:

 
 julia> puzzles."Rating"
 2132989-element Vector{Int64}:
  1765
  1525
  1102
     ⋮
   980
  1783
  2481
  

The effect of this operation is the same as writing puzzles.Rating. You might ask then why is puzzles."Rating" useful?  This syntax makes it easy to handle the cases in which the data frame’s column name contains special characters, such as, for example, spaces. Then the double quotes (") make it unambiguous where the column names start and end. A small downside of writing puzzles."Rating" instead of puzzles.Rating is that it is a bit slower, as internally Julia needs to convert the string to a Symbol before getting the data from a data frame. However, still this operation is fast and is of nanosecond order.

Getting a column using data frame indexing

Using the property access, like puzzles.Rating, to get a column of a data frame is easy to type, but it has one drawback. What if we had the column name stored in a variable like this:

 
 julia> col = "Rating"
 "Rating"
  

How can you get a column of the puzzles data frame that is referenced to by the col variable? Another question is how to get a column by its number, not by its name from a data frame? Both these questions are answered by using the indexing syntax.

The general form of indexing into a data frame is as follows:

 
 data_frame_name[selected_rows, selected_columns]
  

As you can see it is very similar to matrices. There are various options for the accepted values of selected_rows and selected_columns, but in this article we focus on what you should use to get a single column from a data frame.

To get some column from a data frame with copying use colon (:) as row selector and column name as string, or Symbol, or number as column selector. Here are four equivalent ways of getting the "Rating" column from the puzzles data frame with copying:

 
 puzzles[:, "Rating"]
 puzzles[:, :Rating]
 puzzles[:, 4]
 puzzles[:, col]
  

It is important to highlight that when referring to the column of a data frame, you can always use either a string or a Symbol. For user convenience data frame will accept both and treat in the same way. Please choose the style that is more convenient for you. In this book we will usually refer to columns of a data frame using strings.

Note that in the last selector puzzles[:, col] we used the col variable that is bound to the "Rating" string. Allowing for such selection is exactly the benefit of using indexing over using property access.

Finally, you might ask how I have established that "Rating" is the fourth column in our data frame. It is easy to check using the columnindex function:

 
 julia> columnindex(puzzles, "Rating")
 4
  

If some column name is not found in the data frame then the columnindex function returns 0, for example:

 
 julia> columnindex(puzzles, "Some fancy column name")
 0
  

You can also perform a test if some data frame contains some column name using the hasproperty function:

 
 julia> hasproperty(puzzles, "Rating")
 true
  
 julia> hasproperty(puzzles, "Some fancy column name")
 false
  

Note that both in columnindex and hasproperty we could have used a Symbol instead of a string to pass a column name if we preferred so.

Getting a column from a data frame with copying is more expensive than non-copying operation:

 
 julia>  @btime $puzzles[:, :Rating];
   2.170 ms (2 allocations: 16.27 MiB)
  

In our case the time has grown from nanoseconds that we saw for puzzles.Rating selector to milliseconds for puzzles[:, :Rating]. Also, much more memory was used.

To get some column from a data frame without copying use exclamation mark (!) as row selector and column name as string, or Symbol, or number as column selector. Here are four equivalent ways of getting the "Rating" column from the puzzles data frame without copying:

 
 puzzles[!, "Rating"]
 puzzles[!, :Rating]
 puzzles[!, 4]
 puzzles[!, col]
  

Note that this means that, for example, writing puzzles[!, "Rating"] is equivalent to writing puzzles.Rating.

Visualizing data stored in columns of a data frame

Now that we have learned how to get a column from a data frame we are ready to perform the desired plots. The code below uses the histogram function from the Plots.jl package to produce four histograms of columns: "Rating", "RatingDeviation", "Popularity", and "NbPlays".

 
 julia> using Plots
  
 julia> plot(histogram(puzzles.Rating; label="Rating"),
             histogram(puzzles.RatingDeviation; label="RatingDeviation"),
             histogram(puzzles.Popularity; label="Popularity"),
             histogram(puzzles.NbPlays; label="NbPlays"))
  

You can see the result in figure 2. We can see that all these variables are significantly skewed.


Figure 2. Histograms of columns "Rating", "RatingDeviation", "Popularity", and "NbPlays" from the puzzles data frame. All the analyzed variables exhibit skewness.


However, before moving to the next chapter let us write the code generating the plot in an alternative way as an exercise:

 
 julia> plot([histogram(puzzles[!, col]; label=col) for
              col in ["Rating", "RatingDeviation",
                      "Popularity", "NbPlays"]]...)
  

The three dots (...) at the end of the code is splatting that we learned in chapter 4. The reason I have prepared this code is that it presents you how you can leverage the fact that you can pass a variable instead of explicit column name when indexing a data frame. Note that in this case I used a non-copying access to the data (by applying ! as row selector) because I am sure that I will not modify or store the extracted column (the values are only used to produce a plot).

That’s all for now. Thanks for reading.