|
An excerpt from Julia for Data Analysis by Bogumil Kaminski This article shows you how to use Multiple Dispatch in Julia. Read it if you’re a data scientist or anyone who works with lots of data, and if you’re interested in the Julia language. |
Take 25% off Julia for Data Analysis by entering fcckaminski into the discount code box at checkout at manning.com.
Let’s learn how to define functions having different methods and apply this knowledge to a function called winsorized_mean
.
Rules of defining methods for a function
Fortunately, defining methods is relatively easy if you understand the principles of how Julia’s type system works. You just add the type restriction to the arguments of the function after ::
. If the type specification part is omitted, then Julia assumes that value of Any
type is allowed.
Assume we want to create the function fun
taking a single positional argument with the following behavior:
- if it is passed a number it should print
"a number was passed"
unless it is a value having Float64 type in which case we want"a Float64 value"
printed; - in all other cases we want to print
"unsupported type"
.
Here is an example how you can implement this behavior by defining three methods for a function fun
.
julia> fun(x) = println("unsupported type") fun (generic function with 1 method) julia> fun(x::Number) = println("a number was passed") fun (generic function with 2 methods) julia> fun(x::Float64) = println("a Float64 value") fun (generic function with 3 methods) julia> methods(fun) # 3 methods for generic function "fun": [1] fun(x::Float64) in Main at REPL[3]:1 [2] fun(x::Number) in Main at REPL[2]:1 [3] fun(x) in Main at REPL[1]:1 julia> fun("hello!") unsupported type julia> fun(1) a number was passed julia> fun(1.0) a Float64 value
In the example above note that for instance 1
is a Number
(as it is Int
) but it is not Float64
, so the most specific matching method is fun(x::Number)
.
Method ambiguity problem
What you must keep in mind when defining multiple methods for a function is to avoid method ambiguities. They happen when the Julia compiler is not able to decide which method for a given set of arguments should be selected. It is easiest to understand the problem by example. Assume you want to define a bar function taking two positional arguments. It should inform you if any of them were numbers. Here is a first attempt to implement such a function:
julia> bar(x, y) = "no numbers passed" foo (generic function with 1 method) julia> bar(x::Number, y) = "first argument is a number" foo (generic function with 2 methods) julia> bar(x, y::Number) = "second argument is a number" foo (generic function with 3 methods) julia> bar("hello", "world") "no numbers passed" julia> bar(1, "world") "first argument is a number" julia> bar("hello", 2) "second argument is a number" julia> bar(1, 2) ERROR: MethodError: foo(::Int64, ::Int64) is ambiguous. Candidates: bar(x::Number, y) in Main at REPL[2]:1 bar(x, y::Number) in Main at REPL[3]:1 Possible fix, define bar(::Number, ::Number)
As you can see all worked nicely until we wanted to call bar
by passing a number as both its first and second argument. In this case Julia complained that it does not know which method should be called as two of them potentially could be selected. Fortunately, we got a hint on how to resolve the situation. We need to define an additional method that fixes the ambiguity:
julia> bar(x::Number, y::Number) = "both arguments are numbers" foo (generic function with 4 methods) julia> bar(1, 2) "both arguments are numbers" julia> methods(bar) # 4 methods for generic function "foo": [1] bar(x::Number, y::Number) in Main at REPL[8]:1 [2] bar(x::Number, y) in Main at REPL[2]:1 [3] bar(x, y::Number) in Main at REPL[3]:1 [4] bar(x, y) in Main at REPL[1]:1
Improved implementation of winsorized mean
We are ready to improve our winsorized_mean
function definition. Here is how you could implement it:
julia> function winsorized_mean(x::AbstractVector, k::Integer) k >= 0 || throw(ArgumentError("k must be non-negative")) length(x) > 2 * k || throw(ArgumentError("k is too large")) y = sort!(collect(x)) for i in 1:k y[i] = y[k + 1] y[end - i + 1] = y[end - k] end return sum(y) / length(y) end winsorized_mean (generic function with 1 method)
First note that we have restricted the allowed types for x
and k
, therefore if you try invoking the function its arguments must match the required types:
julia> winsorized_mean([8, 3, 1, 5, 7], 1) 5.0 julia> winsorized_mean(1:10, 2) 5.5 julia> winsorized_mean(1:10, "a") ERROR: MethodError: no method matching winsorized_mean(::UnitRange{Int64}, ::String) Closest candidates are: winsorized_mean(::AbstractVector{T} where T, ::Integer) at REPL[6]:1 julia> winsorized_mean(10, 1) ERROR: MethodError: no method matching winsorized_mean(::Int64, ::Int64) Closest candidates are: winsorized_mean(::AbstractVector{T} where T, ::Integer) at REPL[6]:1
Additionally, we can see several things in the code that make it robust. First, we check if passed arguments are consistent, that is, if k
is negative or too large it is invalid, in which case we throw an error by calling the throw
function with ArgumentError
as its argument. See what happens if we pass the wrong k
:
julia> winsorized_mean(1:10, -1) ERROR: ArgumentError: k must be non-negative julia> winsorized_mean(1:10, 5) ERROR: ArgumentError: k is too large
Next make a copy of the data stored in the x
vector before sorting it. To achieve this, we use the collect
function which takes any iterable collection and returns an object storing the same values that has a Vector
type. We pass this vector to the sort!
function to sort it in-place.
You might ask why using the collect
function to allocate a new Vector
is needed. The reason is that for example ranges like 1:10
are read-only, therefore later we would not be able to update y
with y[i] = y[k + 1]
and y[end - i + 1] = y[end - k]
. Additionally, in general Julia can support non-1-based indexing in arrays (see https://github.com/JuliaArrays/OffsetArrays.jl). However, Vector
uses 1-based indexing. In summary using the collect
function turns any collection or general AbstractVector
into a standard Vector
type defined in Julia that is mutable and uses 1-based indexing.
Finally note that instead of performing the for loop manually we have just used the sum
function which is both simpler and more robust.
What is the benefit of having this specialized method? The answer is that the second method is defined as sort(r::AbstractUnitRange) = r
. Since we know that objects of type AbstractUnitRange
are already sorted (they are ranges of values with an increment equal to 1) so we can just return the passed value. In this case, taking advantage of type restriction in method signature can improve the sort
operation performance significantly.
That’s all for now. Thanks for reading.