|
An excerpt from Julia for Data Analysis by Bogumil Kaminski Many languages designed for doing data science provide ways to perform vectorized operations, which is also often called broadcasting. In Julia, broadcasting is also supported. In this article you will see how to use it. Read it if you’re a data scientist or anyone who works with lots of data, and if you’re interested in the Julia language. |
Take 25% off Julia for Data Analysis by entering fcckaminski into the discount code box at checkout at manning.com.
We will discuss how broadcasting works by using Anscombe’s quartet data. However, let us start with explaining of broadcasting on some toy examples.
Syntax and meaning of broadcasting in Julia
An important design rule of the Julia language is that it uses the approach that definitions of functions follow the rules of mathematics. The multiplication operator *
uses matrix multiplication rules. Therefore, the following code follows matrix multiplication rules:
julia> x = [1 2 3] 1×3 Matrix{Int64}: 1 2 3 julia> y = [1, 2, 3] 3-element Vector{Int64}: 1 2 3 julia> x * y 1-element Vector{Int64}: 14
The operation works as we multiply x
which is bound to a 1×3 matrix and a 3-element vector y
and in Julia vectors are always interpreted as columnar.
You might ask then how one should multiply two vectors elementwise, an operation known as Hadamard product in mathematics. Clearly this is not possible just with the *
operator as it does standard matrix multiplication:
julia> a = [1, 2, 3] 3-element Vector{Int64}: 1 2 3 julia> b = [4, 5, 6] 3-element Vector{Int64}: 4 5 6 julia> a * b ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})
You get an error as multiplication of a vector by a vector is not a valid mathematical operation. What we need to do is to broadcast the multiplication. In Julia adding broadcasting to an operator is easy. You just prefix it with a dot (.
) like this:
julia> a .* b 3-element Vector{Int64}: 4 10 18
When a broadcasted operator like .*
is used Julia iterates elements of passed collections, in our case vectors a
and b
, and applies the operator after the dot, in our case *
, elementwise. Therefore, in this case broadcasting result is the same as produced by the following operations:
julia> map(*, a, b) 3-element Vector{Int64}: 4 10 18 julia> [a[i] * b[i] for i in eachindex(a, b)] 3-element Vector{Int64}: 4 10 18
In the map
example we are passing two collections (instead of only one as we did before when we explained how map
works) and the passed function, *
in this case, is applied iteratively elementwise to them until one of the collections gets exhausted.
In the comprehension example it is worth to comment on the eachindex(a, b)
expression which produces:
julia> eachindex(a, b) Base.OneTo(3)
The eachindex
function produces indices that could be used to index into both a
and b
arguments passed to it. In this case these are just integers from 1 to 3. This means that you can index both a
and b
vectors with these values; for example, the following indexing expressions are valid: a[1]
, b[2]
, a[3]
, but a[0]
or b[4]
would not be valid as they are not in the range specified by Base.OneTo(3)
.
If sizes of a
and b
would not match, we get an error:
julia> eachindex([1, 2, 3], [4, 5]) ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)")
This is a very important difference from the map
function which does not use the eachindex
function internally, but instead iterates collections until either of them is exhausted as we have explained above:
julia> map(*, [1, 2, 3], [4, 5]) 2-element Vector{Int64}: 4 10
Using broadcasting just like the eachindex
function checks if the dimensions of the passed objects match:
julia> [1, 2, 3] .* [4, 5] ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 2")
Expansion of length-1 dimensions in broadcasting
There is one exception to the rule that dimensions of all collections taking part in broadcasting must match. This additional rule states that single element dimensions get expanded to match the size of the other collection:
julia> [1, 2, 3] .^ [2] 3-element Vector{Int64}: 1 4 9
You might ask why single element dimension gets expanded? The reason is practical: in most cases when your collection has a single element in some dimension you want it to get expanded. Here is a first example:
julia> [1, 2, 3] .^ 2 3-element Vector{Int64}: 1 4 9
Here we are calculating a square of elements of a vector. Since 2
is a scalar, it is interpreted as having size 1
in each dimension. Most people agree that in such a case dimension expansion should happen. You will see the same behavior both in Python and R.
Now let us consider a second example:
julia> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10] 10×10 Matrix{Int64}: 1 2 3 4 5 6 7 8 9 10 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20 24 28 32 36 40 5 10 15 20 25 30 35 40 45 50 6 12 18 24 30 36 42 48 54 60 7 14 21 28 35 42 49 56 63 70 8 16 24 32 40 48 56 64 72 80 9 18 27 36 45 54 63 72 81 90 10 20 30 40 50 60 70 80 90 100
Here we have created a multiplication table. The reason why the specified operation worked is that we had a 10-element vector having 1 column and 10 rows, and a 10-element matrix having 1 row and 10 columns. In this case dimension expansion happened both for the left- and the right-hand side of the operation.
The technique we have shown above is used very often in practice to get a Cartesian product of all inputs. For instance, in part 2 of the book you will learn that in DataFrames.jl when you write "x" => sum
you ask the package to apply the sum
function to the column "x"
of the data frame. A common task is that we want to apply several functions to several columns of a data frame. Using broadcasting this can be written concisely as:
julia> ["x", "y", "z"] .=> [sum minimum maximum] 3×3 Matrix{Pair{String, _A} where _A}: "x"=>sum "x"=>minimum "x"=>maximum "y"=>sum "y"=>minimum "y"=>maximum "z"=>sum "z"=>minimum "z"=>maximum
Such an expression would ask for computation of sum
, minimum
, and maximum
for columns "x"
, "y"
, and "z"
.
You now know that you can add a dot (.
) before any operator to broadcast it. How about functions that are not operators? Here you also use a dot (.
), but this time you suffix it after the function name. Here is an example:
julia> abs.([1, -2, 3, -4]) 4-element Vector{Int64}: 1 2 3 4
Let me stress that just applying the abs
function to a vector results in an error:
julia> abs([1, 2, 3]) ERROR: MethodError: no method matching abs(::Vector{Int64})
The reason is the same as before: absolute value is mathematically defined for numbers but is not defined for vectors. Of course, you can conveniently also apply broadcasting to a function taking multiple arguments. For example, the string
function concatenates its arguments into a single string:
julia> string(1, 2, 3) "123"
If we use broadcasting on this function, we get for instance the following result:
julia> string.("x", 1:10) 10-element Vector{String}: "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"
Here we expanded the dimension of the scalar "x"
to match the length of the 1:10
range. Such an operation is quite common when one wants to automatically generate names for some objects, for example, columns of some data frame or file names in some folder.
It is important to highlight here that prefixing a dot before the operator or suffixing it to a function name is a fully general solution. It is not a hardcoded functionality of some specific predefined operations. You can use broadcasting with any custom function. For instance:
julia> f(i::Int) = string("got integer ", i) f (generic function with 2 methods) julia> f(s::String) = string("got string ", s) f (generic function with 2 methods) julia> f.([1, "1"]) 2-element Vector{String}: "got integer 1" "got string 1"
Here we have defined two methods for the f
function. As you can see by writing f.
, we have automatically broadcasted it without having to define anything extra.
Protection of collections from being broadcasted over
Before we go back to our Anscombe’s quartet data let me comment on one very common case. What should one do if one does not want to broadcast some collection but to force its reuse along all dimensions as if it were a scalar? To understand this issue let me introduce the in
function first.
julia> in(1, [1, 2, 3]) true julia> in(4, [1, 2, 3]) false
julia> 1 in [1, 2, 3] true julia> 4 in [1, 2, 3] false
Now imagine I have a long vector of values I want to check if they are contained in some vector. When you try the test without broadcasting it does not work like you probably expected:
julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4]) false
The problem is that the vector [1, 3, 5, 7, 9]
is not an element of vector [1, 2, 3, 4]
so we get false
. For a reference let us test the scenario when we put the [1, 3, 5, 7, 9]
vector into the collection in which we look for it:
julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4, [1, 3, 5, 7, 9]]) true
As expected, this time the in
test returned true
.
Going back to the original test note that broadcasting does not seem to work either:
julia> in.([1, 3, 5, 7, 9], [1, 2, 3, 4]) ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 4")
How should we resolve this issue? The solution is to wrap the vector that we want to be expanded with Ref
. In this way we will protect this object from being iterated over. Instead, it will be unwrapped from Ref
and treated by broadcasting as if it were a scalar:
julia> in.([1, 3, 5, 7, 9], Ref([1, 2, 3, 4])) 5-element BitVector: 1 1 0 0 0
This time we got the expected result.
In the output note that Boolean true
is printed as 1 and Boolean false
is printed as 0. This choice of display was made to allow for more convenient visual inspection of large matrices containing Boolean values. To see why this is useful consider if we wanted to check which entries in a multiplication table are odd using the isodd
function:
julia> isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10]) 10×10 BitMatrix: 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0
In this example I have additionally shown you that broadcasting operations can be chained together in a single expression. In this case we broadcasted both the multiplication *
and the isodd
function.
For a reference let me show you how this matrix would be displayed if we changed its element type to Any
:
julia> Matrix{Any}(isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10])) 10×10 Matrix{Any}: true false true false true false true false true false false false false false false false false false false false true false true false true false true false true false false false false false false false false false false false true false true false true false true false true false false false false false false false false false false false true false true false true false true false true false false false false false false false false false false false true false true false true false true false true false false false false false false false false false false false
This time true
and false
is printed to avoid potential confusion with integer 1 and 0 that could be potentially stored in this matrix. In my opinion, however, analyzing such a printout is less convenient than before.
To practice what we have learned try the following exercise that is a quite common task when processing data.
Exercise 1 The parse
function can be used to convert a string into a number. For instance, if you want to parse a string as an integer write parse(Int, "10")
to get an integer 10
. Assume you are given a vector of strings ["1", "2", "3"]
. Your task is to create a vector of integers by parsing the strings contained in the given vector.
Analyzing Anscombe’s quartet data using broadcasting
Now we are ready to load up our Anscombe’s quartet data. Let us initialize the aq
variable first:
julia> aq = [10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.1 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.1 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89] 11×8 Matrix{Float64}: 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.1 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.1 4.0 5.39 19.0 12.5 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89 julia> using Statistics
We will perform two tasks using broadcasting: calculation of the mean of every variable and calculation of the coefficient of determination.
We first start with calculating the mean of the columns of the aq
matrix. When thinking how to do it we notice that we need to broadcast the mean
function over a collection of columns of aq
. Fortunately we know that the eachcol
function gives us such a collection, therefore we can write:
julia> mean.(eachcol(aq)) 8-element Vector{Float64}: 9.0 7.500909090909093 9.0 7.500909090909091 9.0 7.500000000000001 9.0 7.50090909090909
Note the dot (.
) after mean
. If we would forget to write it, we would get:
julia> mean(eachcol(aq)) 11-element Vector{Float64}: 8.6525 7.4525 10.47125 8.56625 9.35875 10.492500000000001 6.3375 7.03125 9.71 6.92625 5.755000000000001
The result is just a mean of 8 vectors constituting columns of aq
matrix, clearly not what we wanted.
As a second application let us rewrite the function calculating the coefficient of determination using broadcasting. Let’s see the implementation
function R²(x, y) X = [ones(11) x] model = X \ y prediction = X * model error = y - prediction SS_res = sum(v -> v ^ 2, error) mean_y = mean(y) SS_tot = sum(v -> (v - mean_y) ^ 2, y) return 1 - SS_res / SS_tot end
If we wanted to use broadcasting, we could write:
function R²(x, y) X = [ones(11) x] model = X \ y prediction = X * model SS_res = sum((y .- prediction) .^ 2) SS_tot = sum((y .- mean(y)) .^ 2) return 1 - SS_res / SS_tot end
As you can see, we have changed formulas for SS_res
and SS_tot
. In both cases we have used the dot (.
) twice. For example, in (y .- prediction) .^ 2
we are broadcasting both subtraction and exponentiation.
By now you know four ways of iteratively applying operations to elements of collections:
- using
for
loops - using comprehensions
- using the
map
function (and other similar higher-order functions that take functions as their arguments) - using broadcasting
You probably ask yourself in which cases you should use which option. Fortunately, this is mostly a convenience and code readability choice. In your projects use the option that is easiest for you to use, and which results in the most readable code. One of the great features of the Julia language is that all these options are fast. This means that most of the time you do not sacrifice performance by choosing one of them over the other. There are exceptions to this rule in some cases. Apart from the differences that I discussed above, one of the most important exceptions is that if you use the for
loop or when you use the map
function, then using the Threads
module or ThreadsX.jl package you can optionally make the operation take advantage of all cores of your processor. Ability to easily support multi-threaded execution of your code is a feature that distinguishes Julia from R and Python.
That’s all for now. Thanks for reading.