Description: https://images.manning.com/360/480/resize/book/e/542293a-5ae6-432b-a2c0-c388b32affd6/Kaminski-MEAP-HI.png

An excerpt from Julia for Data Analysis by Bogumil Kaminski

Many languages designed for doing data science provide ways to perform vectorized operations, which is also often called broadcasting. In Julia, broadcasting is also supported. In this article you will see how to use it.

Read it if you’re a data scientist or anyone who works with lots of data, and if you’re interested in the Julia language.


Take 25% off Julia for Data Analysis by entering fcckaminski into the discount code box at checkout at manning.com.


We will discuss how broadcasting works by using Anscombe’s quartet data. However, let us start with explaining of broadcasting on some toy examples.

Syntax and meaning of broadcasting in Julia

An important design rule of the Julia language is that it uses the approach that definitions of functions follow the rules of mathematics. The multiplication operator * uses matrix multiplication rules. Therefore, the following code follows matrix multiplication rules:

 
 julia> x = [1 2 3]
 1×3 Matrix{Int64}:
  1  2  3
  
 julia> y = [1, 2, 3]
 3-element Vector{Int64}:
  1
  2
  3
  
 julia> x * y
 1-element Vector{Int64}:
  14
  

The operation works as we multiply x which is bound to a 1×3 matrix and a 3-element vector y and in Julia vectors are always interpreted as columnar.

You might ask then how one should multiply two vectors elementwise, an operation known as Hadamard product in mathematics. Clearly this is not possible just with the * operator as it does standard matrix multiplication:

 
 julia> a = [1, 2, 3]
 3-element Vector{Int64}:
  1
  2
  3
  
 julia> b = [4, 5, 6]
 3-element Vector{Int64}:
  4
  5
  6
  
 julia> a * b
 ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})
  

You get an error as multiplication of a vector by a vector is not a valid mathematical operation. What we need to do is to broadcast the multiplication. In Julia adding broadcasting to an operator is easy. You just prefix it with a dot (.) like this:

 
 julia> a .* b
 3-element Vector{Int64}:
   4
  10
  18
  

When a broadcasted operator like .* is used Julia iterates elements of passed collections, in our case vectors a and b, and applies the operator after the dot, in our case *, elementwise. Therefore, in this case broadcasting result is the same as produced by the following operations:

 
 julia> map(*, a, b)
 3-element Vector{Int64}:
   4
  10
  18
  
 julia> [a[i] * b[i] for i in eachindex(a, b)]
 3-element Vector{Int64}:
   4
  10
  18
  

In the map example we are passing two collections (instead of only one as we did before when we explained how map works) and the passed function, * in this case, is applied iteratively elementwise to them until one of the collections gets exhausted.

In the comprehension example it is worth to comment on the eachindex(a, b) expression which produces:

 
 julia> eachindex(a, b)
 Base.OneTo(3)
  

The eachindex function produces indices that could be used to index into both a and b arguments passed to it. In this case these are just integers from 1 to 3. This means that you can index both a and b vectors with these values; for example, the following indexing expressions are valid: a[1], b[2], a[3], but a[0] or b[4] would not be valid as they are not in the range specified by Base.OneTo(3).

If sizes of a and b would not match, we get an error:

 
 julia> eachindex([1, 2, 3], [4, 5])
 ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)")
  

This is a very important difference from the map function which does not use the eachindex function internally, but instead iterates collections until either of them is exhausted as we have explained above:

 
 julia> map(*, [1, 2, 3], [4, 5])
 2-element Vector{Int64}:
   4
  10
  

Practical considerations of using map

If you pass to the map function multiple collections you should always make sure to check beforehand if they have the same length, as most of the time using collections of unequal length with the map function is a bug.

Using broadcasting just like the eachindex function checks if the dimensions of the passed objects match:

 
 julia> [1, 2, 3] .* [4, 5]
 ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 2")
  

Expansion of length-1 dimensions in broadcasting

There is one exception to the rule that dimensions of all collections taking part in broadcasting must match. This additional rule states that single element dimensions get expanded to match the size of the other collection:

 
 julia> [1, 2, 3] .^ [2]
 3-element Vector{Int64}:
   1
   4
   9
  

You might ask why single element dimension gets expanded? The reason is practical: in most cases when your collection has a single element in some dimension you want it to get expanded. Here is a first example:

 
 julia> [1, 2, 3] .^ 2
 3-element Vector{Int64}:
  1
  4
  9
  

Here we are calculating a square of elements of a vector. Since 2 is a scalar, it is interpreted as having size 1 in each dimension. Most people agree that in such a case dimension expansion should happen. You will see the same behavior both in Python and R.

Now let us consider a second example:

 
 julia> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10]
 10×10 Matrix{Int64}:
   1   2   3   4   5   6   7   8   9   10
   2   4   6   8  10  12  14  16  18   20
   3   6   9  12  15  18  21  24  27   30
   4   8  12  16  20  24  28  32  36   40
   5  10  15  20  25  30  35  40  45   50
   6  12  18  24  30  36  42  48  54   60
   7  14  21  28  35  42  49  56  63   70
   8  16  24  32  40  48  56  64  72   80
   9  18  27  36  45  54  63  72  81   90
  10  20  30  40  50  60  70  80  90  100
  

Here we have created a multiplication table. The reason why the specified operation worked is that we had a 10-element vector having 1 column and 10 rows, and a 10-element matrix having 1 row and 10 columns. In this case dimension expansion happened both for the left- and the right-hand side of the operation.

The technique we have shown above is used very often in practice to get a Cartesian product of all inputs. For instance, in part 2 of the book you will learn that in DataFrames.jl when you write "x" => sum you ask the package to apply the sum function to the column "x" of the data frame. A common task is that we want to apply several functions to several columns of a data frame. Using broadcasting this can be written concisely as:

 
 julia> ["x", "y", "z"] .=> [sum minimum maximum]
 3×3 Matrix{Pair{String, _A} where _A}:
  "x"=>sum  "x"=>minimum  "x"=>maximum
  "y"=>sum  "y"=>minimum  "y"=>maximum
  "z"=>sum  "z"=>minimum  "z"=>maximum
  

Such an expression would ask for computation of sum, minimum, and maximum for columns "x", "y", and "z".

You now know that you can add a dot (.) before any operator to broadcast it. How about functions that are not operators? Here you also use a dot (.), but this time you suffix it after the function name. Here is an example:

 
 julia> abs.([1, -2, 3, -4])
 4-element Vector{Int64}:
  1
  2
  3
  4
  

Let me stress that just applying the abs function to a vector results in an error:

 
 julia> abs([1, 2, 3])
 ERROR: MethodError: no method matching abs(::Vector{Int64})
  

The reason is the same as before: absolute value is mathematically defined for numbers but is not defined for vectors. Of course, you can conveniently also apply broadcasting to a function taking multiple arguments. For example, the string function concatenates its arguments into a single string:

 
 julia> string(1, 2, 3)
 "123"
  

If we use broadcasting on this function, we get for instance the following result:

 
 julia> string.("x", 1:10)
 10-element Vector{String}:
  "x1"
  "x2"
  "x3"
  "x4"
  "x5"
  "x6"
  "x7"
  "x8"
  "x9"
  "x10"
  

Here we expanded the dimension of the scalar "x" to match the length of the 1:10 range. Such an operation is quite common when one wants to automatically generate names for some objects, for example, columns of some data frame or file names in some folder.

It is important to highlight here that prefixing a dot before the operator or suffixing it to a function name is a fully general solution. It is not a hardcoded functionality of some specific predefined operations. You can use broadcasting with any custom function. For instance:

 
 julia> f(i::Int) = string("got integer ", i)
 f (generic function with 2 methods)
  
 julia> f(s::String) = string("got string ", s)
 f (generic function with 2 methods)
  
 julia> f.([1, "1"])
 2-element Vector{String}:
  "got integer 1"
  "got string 1"
  

Here we have defined two methods for the f function. As you can see by writing f., we have automatically broadcasted it without having to define anything extra.

Protection of collections from being broadcasted over

Before we go back to our Anscombe’s quartet data let me comment on one very common case. What should one do if one does not want to broadcast some collection but to force its reuse along all dimensions as if it were a scalar? To understand this issue let me introduce the in function first.

The in function

The in function is used to check if some value is contained in some collection. For example,

 
 julia> in(1, [1, 2, 3])
 true
  
 julia> in(4, [1, 2, 3])
 false
  

As a convenience, in also supports infix notation

 
 julia> 1 in [1, 2, 3]
 true
  
 julia> 4 in [1, 2, 3]
 false
  

Observe that you already know this infix notation as it is used when defining iterations in for loops, see section 2.2 for an explanation how such loops work.

Now imagine I have a long vector of values I want to check if they are contained in some vector. When you try the test without broadcasting it does not work like you probably expected:

 
 julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4])
 false
  

The problem is that the vector [1, 3, 5, 7, 9] is not an element of vector [1, 2, 3, 4] so we get false. For a reference let us test the scenario when we put the [1, 3, 5, 7, 9] vector into the collection in which we look for it:

 
 julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4, [1, 3, 5, 7, 9]])
 true
  

As expected, this time the in test returned true.

Going back to the original test note that broadcasting does not seem to work either:

 
 julia> in.([1, 3, 5, 7, 9], [1, 2, 3, 4])
 ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 4")
  

How should we resolve this issue? The solution is to wrap the vector that we want to be expanded with Ref. In this way we will protect this object from being iterated over. Instead, it will be unwrapped from Ref and treated by broadcasting as if it were a scalar:

 
 julia> in.([1, 3, 5, 7, 9], Ref([1, 2, 3, 4]))
 5-element BitVector:
  1
  1
  0
  0
  0
  

This time we got the expected result.

What is Ref in Julia?

In Julia when you write r = Ref(x) you create a 0-dimensional container storing the x value as its only element. You can retrieve the x object from the Ref value r by writing r[] (notice that we do not pass any indices in the indexing syntax, as the r object is 0-dimensional). The type is named Ref as you can think of it that r is a reference to x.

Since Ref objects are 0-dimensional and store exactly one element they have length 1 in every dimension. In consequence if you use the r object in broadcasting the x value stored in it is used in all required dimensions following expansion rules.

In the output note that Boolean true is printed as 1 and Boolean false is printed as 0. This choice of display was made to allow for more convenient visual inspection of large matrices containing Boolean values. To see why this is useful consider if we wanted to check which entries in a multiplication table are odd using the isodd function:

 
 julia> isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10])
 10×10 BitMatrix:
  1  0  1  0  1  0  1  0  1  0
  0  0  0  0  0  0  0  0  0  0
  1  0  1  0  1  0  1  0  1  0
  0  0  0  0  0  0  0  0  0  0
  1  0  1  0  1  0  1  0  1  0
  0  0  0  0  0  0  0  0  0  0
  1  0  1  0  1  0  1  0  1  0
  0  0  0  0  0  0  0  0  0  0
  1  0  1  0  1  0  1  0  1  0
  0  0  0  0  0  0  0  0  0  0
  

In this example I have additionally shown you that broadcasting operations can be chained together in a single expression. In this case we broadcasted both the multiplication * and the isodd function.

For a reference let me show you how this matrix would be displayed if we changed its element type to Any:

 
 julia> Matrix{Any}(isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .*
                           [1 2 3 4 5 6 7 8 9 10]))
 10×10 Matrix{Any}:
   true  false   true  false   true  false   true  false   true  false
  false  false  false  false  false  false  false  false  false  false
   true  false   true  false   true  false   true  false   true  false
  false  false  false  false  false  false  false  false  false  false
   true  false   true  false   true  false   true  false   true  false
  false  false  false  false  false  false  false  false  false  false
   true  false   true  false   true  false   true  false   true  false
  false  false  false  false  false  false  false  false  false  false
   true  false   true  false   true  false   true  false   true  false
  false  false  false  false  false  false  false  false  false  false
  

This time true and false is printed to avoid potential confusion with integer 1 and 0 that could be potentially stored in this matrix. In my opinion, however, analyzing such a printout is less convenient than before.

To practice what we have learned try the following exercise that is a quite common task when processing data.

Exercise 1 The parse function can be used to convert a string into a number. For instance, if you want to parse a string as an integer write parse(Int, "10") to get an integer 10. Assume you are given a vector of strings ["1", "2", "3"]. Your task is to create a vector of integers by parsing the strings contained in the given vector. 

Analyzing Anscombe’s quartet data using broadcasting

Now we are ready to load up our Anscombe’s quartet data. Let us initialize the aq variable first:

 
 julia> aq = [10.0   8.04  10.0  9.14  10.0   7.46   8.0   6.58
               8.0   6.95   8.0  8.14   8.0   6.77   8.0   5.76
              13.0   7.58  13.0  8.74  13.0  12.74   8.0   7.71
               9.0   8.81   9.0  8.77   9.0   7.11   8.0   8.84
              11.0   8.33  11.0  9.26  11.0   7.81   8.0   8.47
              14.0   9.96  14.0  8.1   14.0   8.84   8.0   7.04
               6.0   7.24   6.0  6.13   6.0   6.08   8.0   5.25
               4.0   4.26   4.0  3.1    4.0   5.39  19.0  12.50
              12.0  10.84  12.0  9.13  12.0   8.15   8.0   5.56
               7.0   4.82   7.0  7.26   7.0   6.42   8.0   7.91
               5.0   5.68   5.0  4.74   5.0   5.73   8.0   6.89]
 11×8 Matrix{Float64}:
  10.0   8.04  10.0  9.14  10.0   7.46   8.0   6.58
   8.0   6.95   8.0  8.14   8.0   6.77   8.0   5.76
  13.0   7.58  13.0  8.74  13.0  12.74   8.0   7.71
   9.0   8.81   9.0  8.77   9.0   7.11   8.0   8.84
  11.0   8.33  11.0  9.26  11.0   7.81   8.0   8.47
  14.0   9.96  14.0  8.1   14.0   8.84   8.0   7.04
   6.0   7.24   6.0  6.13   6.0   6.08   8.0   5.25
   4.0   4.26   4.0  3.1    4.0   5.39  19.0  12.5
  12.0  10.84  12.0  9.13  12.0   8.15   8.0   5.56
   7.0   4.82   7.0  7.26   7.0   6.42   8.0   7.91
   5.0   5.68   5.0  4.74   5.0   5.73   8.0   6.89
  
 julia> using Statistics
  

We will perform two tasks using broadcasting: calculation of the mean of every variable and calculation of the coefficient of determination.

We first start with calculating the mean of the columns of the aq matrix. When thinking how to do it we notice that we need to broadcast the mean function over a collection of columns of aq. Fortunately we know that the eachcol function gives us such a collection, therefore we can write:

 
 julia> mean.(eachcol(aq))
 8-element Vector{Float64}:
  9.0
  7.500909090909093
  9.0
  7.500909090909091
  9.0
  7.500000000000001
  9.0
  7.50090909090909
  

Note the dot (.) after mean. If we would forget to write it, we would get:

 
 julia> mean(eachcol(aq))
 11-element Vector{Float64}:
   8.6525
   7.4525
  10.47125
   8.56625
   9.35875
  10.492500000000001
   6.3375
   7.03125
   9.71
   6.92625
   5.755000000000001
  

The result is just a mean of 8 vectors constituting columns of aq matrix, clearly not what we wanted.

As a second application let us rewrite the function calculating the coefficient of determination using broadcasting. Let’s see the implementation

 
 function R²(x, y)
     X = [ones(11) x]
     model = X \ y
     prediction = X * model
     error = y - prediction
     SS_res = sum(v -> v ^ 2, error)
     mean_y = mean(y)
     SS_tot = sum(v -> (v - mean_y) ^ 2, y)
     return 1 - SS_res / SS_tot
 end
  

If we wanted to use broadcasting, we could write:

 
 function R²(x, y)
     X = [ones(11) x]
     model = X \ y
     prediction = X * model
     SS_res = sum((y .- prediction) .^ 2)
     SS_tot = sum((y .- mean(y)) .^ 2)
     return 1 - SS_res / SS_tot
 end
  

As you can see, we have changed formulas for SS_res and SS_tot. In both cases we have used the dot (.) twice. For example, in (y .- prediction) .^ 2 we are broadcasting both subtraction and exponentiation.

Efficiency of broadcasting in Julia

A very important feature of Julia that differentiates it from R and Python is that if it encounters several broadcasting operations in a single expression chained together, it performs such operation in one pass without allocating any intermediate objects. This feature is called broadcast fusion and greatly improves the performance of complex broadcasted operations. Broadcast fusion can be efficient because Julia compiles your program as a whole, so when broadcasting operation is encountered the compiler can fully optimize the native code that gets executed. This is different from R and Python, where support for broadcasted operations are usually implemented in languages like C and stored in precompiled binaries for a limited predefined set of functions.

If you would like to learn more how this feature of the Julia language works, I recommend you start with this blog post https://julialang.org/blog/2018/05/extensible-broadcast-fusion/.

By now you know four ways of iteratively applying operations to elements of collections:

  • using for loops
  • using comprehensions
  • using the map function (and other similar higher-order functions that take functions as their arguments)
  • using broadcasting

You probably ask yourself in which cases you should use which option. Fortunately, this is mostly a convenience and code readability choice. In your projects use the option that is easiest for you to use, and which results in the most readable code. One of the great features of the Julia language is that all these options are fast. This means that most of the time you do not sacrifice performance by choosing one of them over the other. There are exceptions to this rule in some cases. Apart from the differences that I discussed above, one of the most important exceptions is that if you use the for loop or when you use the map function, then using the Threads module or ThreadsX.jl package you can optionally make the operation take advantage of all cores of your processor. Ability to easily support multi-threaded execution of your code is a feature that distinguishes Julia from R and Python.

That’s all for now. Thanks for reading.