 An excerpt from Julia for Data Analysis by Bogumil Kaminski Many languages designed for doing data science provide ways to perform vectorized operations, which is also often called broadcasting. In Julia, broadcasting is also supported. In this article you will see how to use it. Read it if you’re a data scientist or anyone who works with lots of data, and if you’re interested in the Julia language.

Take 25% off Julia for Data Analysis by entering fcckaminski into the discount code box at checkout at manning.com.

We will discuss how broadcasting works by using Anscombe’s quartet data. However, let us start with explaining of broadcasting on some toy examples.

Syntax and meaning of broadcasting in Julia

An important design rule of the Julia language is that it uses the approach that definitions of functions follow the rules of mathematics. The multiplication operator `*` uses matrix multiplication rules. Therefore, the following code follows matrix multiplication rules:

Hundreds of Instagram messages are now so much liked by numerous users and of course they would like to get the chance to download the one they favor. So, below is our instagram video clip download – a wonderful on-line solution; a tool for downloading video clips from Instagram.

```
julia> x = [1 2 3]
1×3 Matrix{Int64}:
1  2  3

julia> y = [1, 2, 3]
3-element Vector{Int64}:
1
2
3

julia> x * y
1-element Vector{Int64}:
14

```

The operation works as we multiply `x` which is bound to a 1×3 matrix and a 3-element vector `y` and in Julia vectors are always interpreted as columnar.

You might ask then how one should multiply two vectors elementwise, an operation known as Hadamard product in mathematics. Clearly this is not possible just with the `*` operator as it does standard matrix multiplication:

```
julia> a = [1, 2, 3]
3-element Vector{Int64}:
1
2
3

julia> b = [4, 5, 6]
3-element Vector{Int64}:
4
5
6

julia> a * b
ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})

```

You get an error as multiplication of a vector by a vector is not a valid mathematical operation. What we need to do is to broadcast the multiplication. In Julia adding broadcasting to an operator is easy. You just prefix it with a dot (`.`) like this:

```
julia> a .* b
3-element Vector{Int64}:
4
10
18

```

When a broadcasted operator like `.*` is used Julia iterates elements of passed collections, in our case vectors `a` and `b`, and applies the operator after the dot, in our case `*`, elementwise. Therefore, in this case broadcasting result is the same as produced by the following operations:

```
julia> map(*, a, b)
3-element Vector{Int64}:
4
10
18

julia> [a[i] * b[i] for i in eachindex(a, b)]
3-element Vector{Int64}:
4
10
18

```

In the `map` example we are passing two collections (instead of only one as we did before when we explained how `map` works) and the passed function, `*` in this case, is applied iteratively elementwise to them until one of the collections gets exhausted.

In the comprehension example it is worth to comment on the `eachindex(a, b)` expression which produces:

```
julia> eachindex(a, b)
Base.OneTo(3)

```

The `eachindex` function produces indices that could be used to index into both `a` and `b` arguments passed to it. In this case these are just integers from 1 to 3. This means that you can index both `a` and `b` vectors with these values; for example, the following indexing expressions are valid: `a`, `b`, `a`, but `a` or `b` would not be valid as they are not in the range specified by `Base.OneTo(3)`.

If sizes of `a` and `b` would not match, we get an error:

```
julia> eachindex([1, 2, 3], [4, 5])
ERROR: DimensionMismatch("all inputs to eachindex must have the same indices, got Base.OneTo(3) and Base.OneTo(2)")

```

This is a very important difference from the `map` function which does not use the `eachindex` function internally, but instead iterates collections until either of them is exhausted as we have explained above:

```
julia> map(*, [1, 2, 3], [4, 5])
2-element Vector{Int64}:
4
10

```

Practical considerations of using map

If you pass to the `map` function multiple collections you should always make sure to check beforehand if they have the same length, as most of the time using collections of unequal length with the `map` function is a bug.

Using broadcasting just like the `eachindex` function checks if the dimensions of the passed objects match:

```
julia> [1, 2, 3] .* [4, 5]
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 3 and 2")

```

Expansion of length-1 dimensions in broadcasting

There is one exception to the rule that dimensions of all collections taking part in broadcasting must match. This additional rule states that single element dimensions get expanded to match the size of the other collection:

```
julia> [1, 2, 3] .^ 
3-element Vector{Int64}:
1
4
9

```

You might ask why single element dimension gets expanded? The reason is practical: in most cases when your collection has a single element in some dimension you want it to get expanded. Here is a first example:

```
julia> [1, 2, 3] .^ 2
3-element Vector{Int64}:
1
4
9

```

Here we are calculating a square of elements of a vector. Since `2` is a scalar, it is interpreted as having size `1` in each dimension. Most people agree that in such a case dimension expansion should happen. You will see the same behavior both in Python and R.

Now let us consider a second example:

```
julia> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10]
10×10 Matrix{Int64}:
1   2   3   4   5   6   7   8   9   10
2   4   6   8  10  12  14  16  18   20
3   6   9  12  15  18  21  24  27   30
4   8  12  16  20  24  28  32  36   40
5  10  15  20  25  30  35  40  45   50
6  12  18  24  30  36  42  48  54   60
7  14  21  28  35  42  49  56  63   70
8  16  24  32  40  48  56  64  72   80
9  18  27  36  45  54  63  72  81   90
10  20  30  40  50  60  70  80  90  100

```

Here we have created a multiplication table. The reason why the specified operation worked is that we had a 10-element vector having 1 column and 10 rows, and a 10-element matrix having 1 row and 10 columns. In this case dimension expansion happened both for the left- and the right-hand side of the operation.

The technique we have shown above is used very often in practice to get a Cartesian product of all inputs. For instance, in part 2 of the book you will learn that in DataFrames.jl when you write `"x" => sum` you ask the package to apply the `sum` function to the column `"x"` of the data frame. A common task is that we want to apply several functions to several columns of a data frame. Using broadcasting this can be written concisely as:

```
julia> ["x", "y", "z"] .=> [sum minimum maximum]
3×3 Matrix{Pair{String, _A} where _A}:
"x"=>sum  "x"=>minimum  "x"=>maximum
"y"=>sum  "y"=>minimum  "y"=>maximum
"z"=>sum  "z"=>minimum  "z"=>maximum

```

Such an expression would ask for computation of `sum`, `minimum`, and `maximum` for columns `"x"`, `"y"`, and `"z"`.

You now know that you can add a dot (`.`) before any operator to broadcast it. How about functions that are not operators? Here you also use a dot (`.`), but this time you suffix it after the function name. Here is an example:

```
julia> abs.([1, -2, 3, -4])
4-element Vector{Int64}:
1
2
3
4

```

Let me stress that just applying the `abs` function to a vector results in an error:

```
julia> abs([1, 2, 3])
ERROR: MethodError: no method matching abs(::Vector{Int64})

```

The reason is the same as before: absolute value is mathematically defined for numbers but is not defined for vectors. Of course, you can conveniently also apply broadcasting to a function taking multiple arguments. For example, the `string` function concatenates its arguments into a single string:

```
julia> string(1, 2, 3)
"123"

```

If we use broadcasting on this function, we get for instance the following result:

```
julia> string.("x", 1:10)
10-element Vector{String}:
"x1"
"x2"
"x3"
"x4"
"x5"
"x6"
"x7"
"x8"
"x9"
"x10"

```

Here we expanded the dimension of the scalar `"x"` to match the length of the `1:10` range. Such an operation is quite common when one wants to automatically generate names for some objects, for example, columns of some data frame or file names in some folder.

It is important to highlight here that prefixing a dot before the operator or suffixing it to a function name is a fully general solution. It is not a hardcoded functionality of some specific predefined operations. You can use broadcasting with any custom function. For instance:

```
julia> f(i::Int) = string("got integer ", i)
f (generic function with 2 methods)

julia> f(s::String) = string("got string ", s)
f (generic function with 2 methods)

julia> f.([1, "1"])
2-element Vector{String}:
"got integer 1"
"got string 1"

```

Here we have defined two methods for the `f` function. As you can see by writing `f.`, we have automatically broadcasted it without having to define anything extra.

Protection of collections from being broadcasted over

Before we go back to our Anscombe’s quartet data let me comment on one very common case. What should one do if one does not want to broadcast some collection but to force its reuse along all dimensions as if it were a scalar? To understand this issue let me introduce the `in` function first.

The `in` function

The `in` function is used to check if some value is contained in some collection. For example,

```
julia> in(1, [1, 2, 3])
true

julia> in(4, [1, 2, 3])
false

```

As a convenience, `in` also supports infix notation

```
julia> 1 in [1, 2, 3]
true

julia> 4 in [1, 2, 3]
false

```

Observe that you already know this infix notation as it is used when defining iterations in `for` loops, see section 2.2 for an explanation how such loops work.

Now imagine I have a long vector of values I want to check if they are contained in some vector. When you try the test without broadcasting it does not work like you probably expected:

```
julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4])
false

```

The problem is that the vector `[1, 3, 5, 7, 9]` is not an element of vector `[1, 2, 3, 4]` so we get `false`. For a reference let us test the scenario when we put the `[1, 3, 5, 7, 9]` vector into the collection in which we look for it:

```
julia> in([1, 3, 5, 7, 9], [1, 2, 3, 4, [1, 3, 5, 7, 9]])
true

```

As expected, this time the `in` test returned `true`.

Going back to the original test note that broadcasting does not seem to work either:

```
julia> in.([1, 3, 5, 7, 9], [1, 2, 3, 4])
ERROR: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 5 and 4")

```

How should we resolve this issue? The solution is to wrap the vector that we want to be expanded with `Ref`. In this way we will protect this object from being iterated over. Instead, it will be unwrapped from `Ref` and treated by broadcasting as if it were a scalar:

```
julia> in.([1, 3, 5, 7, 9], Ref([1, 2, 3, 4]))
5-element BitVector:
1
1
0
0
0

```

This time we got the expected result.

What is `Ref` in Julia?

In Julia when you write `r = Ref(x)` you create a 0-dimensional container storing the `x` value as its only element. You can retrieve the `x` object from the `Ref` value `r` by writing `r[]` (notice that we do not pass any indices in the indexing syntax, as the `r` object is 0-dimensional). The type is named `Ref` as you can think of it that `r` is a reference to `x`.

Since `Ref` objects are 0-dimensional and store exactly one element they have length 1 in every dimension. In consequence if you use the `r` object in broadcasting the `x` value stored in it is used in all required dimensions following expansion rules.

In the output note that Boolean `true` is printed as 1 and Boolean `false` is printed as 0. This choice of display was made to allow for more convenient visual inspection of large matrices containing Boolean values. To see why this is useful consider if we wanted to check which entries in a multiplication table are odd using the `isodd` function:

```
julia> isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .* [1 2 3 4 5 6 7 8 9 10])
10×10 BitMatrix:
1  0  1  0  1  0  1  0  1  0
0  0  0  0  0  0  0  0  0  0
1  0  1  0  1  0  1  0  1  0
0  0  0  0  0  0  0  0  0  0
1  0  1  0  1  0  1  0  1  0
0  0  0  0  0  0  0  0  0  0
1  0  1  0  1  0  1  0  1  0
0  0  0  0  0  0  0  0  0  0
1  0  1  0  1  0  1  0  1  0
0  0  0  0  0  0  0  0  0  0

```

In this example I have additionally shown you that broadcasting operations can be chained together in a single expression. In this case we broadcasted both the multiplication `*` and the `isodd` function.

For a reference let me show you how this matrix would be displayed if we changed its element type to `Any`:

```
julia> Matrix{Any}(isodd.([1, 2, 3, 4, 5, 6, 7, 8, 9, 10] .*
[1 2 3 4 5 6 7 8 9 10]))
10×10 Matrix{Any}:
true  false   true  false   true  false   true  false   true  false
false  false  false  false  false  false  false  false  false  false
true  false   true  false   true  false   true  false   true  false
false  false  false  false  false  false  false  false  false  false
true  false   true  false   true  false   true  false   true  false
false  false  false  false  false  false  false  false  false  false
true  false   true  false   true  false   true  false   true  false
false  false  false  false  false  false  false  false  false  false
true  false   true  false   true  false   true  false   true  false
false  false  false  false  false  false  false  false  false  false

```

This time `true` and `false` is printed to avoid potential confusion with integer 1 and 0 that could be potentially stored in this matrix. In my opinion, however, analyzing such a printout is less convenient than before.

To practice what we have learned try the following exercise that is a quite common task when processing data.

Exercise 1 The `parse` function can be used to convert a string into a number. For instance, if you want to parse a string as an integer write `parse(Int, "10")` to get an integer `10`. Assume you are given a vector of strings `["1", "2", "3"]`. Your task is to create a vector of integers by parsing the strings contained in the given vector.

Analyzing Anscombe’s quartet data using broadcasting

Now we are ready to load up our Anscombe’s quartet data. Let us initialize the `aq` variable first:

```
julia> aq = [10.0   8.04  10.0  9.14  10.0   7.46   8.0   6.58
8.0   6.95   8.0  8.14   8.0   6.77   8.0   5.76
13.0   7.58  13.0  8.74  13.0  12.74   8.0   7.71
9.0   8.81   9.0  8.77   9.0   7.11   8.0   8.84
11.0   8.33  11.0  9.26  11.0   7.81   8.0   8.47
14.0   9.96  14.0  8.1   14.0   8.84   8.0   7.04
6.0   7.24   6.0  6.13   6.0   6.08   8.0   5.25
4.0   4.26   4.0  3.1    4.0   5.39  19.0  12.50
12.0  10.84  12.0  9.13  12.0   8.15   8.0   5.56
7.0   4.82   7.0  7.26   7.0   6.42   8.0   7.91
5.0   5.68   5.0  4.74   5.0   5.73   8.0   6.89]
11×8 Matrix{Float64}:
10.0   8.04  10.0  9.14  10.0   7.46   8.0   6.58
8.0   6.95   8.0  8.14   8.0   6.77   8.0   5.76
13.0   7.58  13.0  8.74  13.0  12.74   8.0   7.71
9.0   8.81   9.0  8.77   9.0   7.11   8.0   8.84
11.0   8.33  11.0  9.26  11.0   7.81   8.0   8.47
14.0   9.96  14.0  8.1   14.0   8.84   8.0   7.04
6.0   7.24   6.0  6.13   6.0   6.08   8.0   5.25
4.0   4.26   4.0  3.1    4.0   5.39  19.0  12.5
12.0  10.84  12.0  9.13  12.0   8.15   8.0   5.56
7.0   4.82   7.0  7.26   7.0   6.42   8.0   7.91
5.0   5.68   5.0  4.74   5.0   5.73   8.0   6.89

julia> using Statistics

```

We will perform two tasks using broadcasting: calculation of the mean of every variable and calculation of the coefficient of determination.

We first start with calculating the mean of the columns of the `aq` matrix. When thinking how to do it we notice that we need to broadcast the `mean` function over a collection of columns of `aq`. Fortunately we know that the `eachcol` function gives us such a collection, therefore we can write:

```
julia> mean.(eachcol(aq))
8-element Vector{Float64}:
9.0
7.500909090909093
9.0
7.500909090909091
9.0
7.500000000000001
9.0
7.50090909090909

```

Note the dot (`.`) after `mean`. If we would forget to write it, we would get:

```
julia> mean(eachcol(aq))
11-element Vector{Float64}:
8.6525
7.4525
10.47125
8.56625
9.35875
10.492500000000001
6.3375
7.03125
9.71
6.92625
5.755000000000001

```

The result is just a mean of 8 vectors constituting columns of `aq` matrix, clearly not what we wanted.

As a second application let us rewrite the function calculating the coefficient of determination using broadcasting. Let’s see the implementation

```
function R²(x, y)
X = [ones(11) x]
model = X \ y
prediction = X * model
error = y - prediction
SS_res = sum(v -> v ^ 2, error)
mean_y = mean(y)
SS_tot = sum(v -> (v - mean_y) ^ 2, y)
return 1 - SS_res / SS_tot
end

```

If we wanted to use broadcasting, we could write:

```
function R²(x, y)
X = [ones(11) x]
model = X \ y
prediction = X * model
SS_res = sum((y .- prediction) .^ 2)
SS_tot = sum((y .- mean(y)) .^ 2)
return 1 - SS_res / SS_tot
end

```

As you can see, we have changed formulas for `SS_res` and `SS_tot`. In both cases we have used the dot (`.`) twice. For example, in `(y .- prediction) .^ 2` we are broadcasting both subtraction and exponentiation.

A very important feature of Julia that differentiates it from R and Python is that if it encounters several broadcasting operations in a single expression chained together, it performs such operation in one pass without allocating any intermediate objects. This feature is called broadcast fusion and greatly improves the performance of complex broadcasted operations. Broadcast fusion can be efficient because Julia compiles your program as a whole, so when broadcasting operation is encountered the compiler can fully optimize the native code that gets executed. This is different from R and Python, where support for broadcasted operations are usually implemented in languages like C and stored in precompiled binaries for a limited predefined set of functions.

• using `for` loops
• using the `map` function (and other similar higher-order functions that take functions as their arguments)
You probably ask yourself in which cases you should use which option. Fortunately, this is mostly a convenience and code readability choice. In your projects use the option that is easiest for you to use, and which results in the most readable code. One of the great features of the Julia language is that all these options are fast. This means that most of the time you do not sacrifice performance by choosing one of them over the other. There are exceptions to this rule in some cases. Apart from the differences that I discussed above, one of the most important exceptions is that if you use the `for` loop or when you use the `map` function, then using the `Threads` module or ThreadsX.jl package you can optionally make the operation take advantage of all cores of your processor. Ability to easily support multi-threaded execution of your code is a feature that distinguishes Julia from R and Python.