csefalvay_2_01b By Chris von Csefalvay

In this article, excerpted from Learn Julia, I talk about strings in Julia, including their constituent elements.

 

Nobody knows why strings are called strings – the theory is that this name reflects that strings are composed of a number of individual characters “strung together.” What is, however, clear beyond a doubt is that strings are a very important primitive data type. From text mining through accounts to indeed any time alphanumeric information needs to be stored, strings get involved. Anything that can be represented as a number of Unicode characters can be represented as a string.

Before we can understand strings, we need to therefore understand their constituent elements – characters, represented in Julia by the Char type. A Char is, essentially, a 32-bit bits type, with the added flair that the numeric value is interpreted as a point on a Unicode code table, with the value of the Char corresponding to a point in the Unicode code table.

One of the consequences of this is the (initially unusual) result of adding an integer to a character object:

julia> 'a' + 1 #A
 b #B
 julia> typeof(ans) #C Char #D

#A The single quotation mark constructs a Char type with the value ‘a’.

#B Adding 1 to ‘a’ yields the character with the code point 1 larger than that of ‘a’ – namely, ‘b’.

#C Using typeof(), we are examining the type of the previous answer (represented by the variable ans, which is bound to the last answer value) #D The type of the result will also be a Char.

Let’s create a Char! In the following, I am using two different ways to create a Char literal:

julia> char1 = Char('F')
 'F'
 julia> char2 = Char(255) 'ÿ'

Unlike many other programming languages, Julia draws an important difference between single and double quotes. A double quote denotes a string, but a single quotation mark represents a Char object. This leads to a frequent pitfall of programmers coming from languages that do not have this differentiation trying to enter a string literal with single quotes, yielding the notorious invalid character literal error! In the above example, I have used the character literal form and single quotes to create a Char object with the value corresponding to the character F – note that the value of a character literal is not a character, but a number, even if Julia is keen on representing the literal as a character!

For char2, I have used construction of the character by entering its decimal value.  Commonly, when it comes to Unicode, values are given not as decimal numbers but as hexadecimals. Julia allows you to enter hexadecimal Char references in two ways:

  • using the char constructor function and entering the hexadecimal value as you would any hexadecimal number (e.g. 0xff), or
  • entering the character as a Unicode literal: \U followed by eight hexadecimal digits (\U) or \U followed by eight. As this is a character literal format, you would have to enclose it in single quotation marks.

As we have seen above, Julia allows for comparisons and some arithmetic with Char objects, treating them as the integer values corresponding to their character representation.

Sometimes, this is useful, e.g., when trying to simulate the Caesar cipher:

julia> ('C'+1),('a'+1),('e'+1),('s'+1),('a'+1),('r'+1) 
       ('D','b','f','t','b','s')

It’s crucial to understand that Char objects are very different from strings of length 1. While we don’t (yet) know much about the latter, suffice it to say that while strings are true character formats, a Char object is really just a number with a specialized representation.


Enter the Strings: ASCIIString and UTF strings

Julia can represent a string literal in multiple different ways, each with a different range of characters it can hold but also with correspondingly increasing memory requirements. When entering a literal, Julia will determine the most economical form to hold the string.

  • If the string literal only contains characters on the ASCII code table, it can be represented by ASCIIString.
  • If the string literal contains a character that can be represented as a UTF-8 character, the literal will be represented as a UTF8String.

ASCII is a directly indexable string format – this is so because every ASCII character has the same size. UTF-8, on the other hand is not directly indexable, because a UTF-8 encoding of a character may be 8, 16 or 32 bits long (unlike for UTF-32, which encodes every character mandatorily as a 32-bit object). The two default types, UTF8String and ASCIIString, form a union type called ByteString.


 csefalvay_2_02

Figure 1 Type hierarchy of string types. Dotted outlines refer to “abstract types,” types that are not directly instantiated. As such, for instance, no object would be of the type DirectIndexString, but rather of a type which descends from DirectIndexString. Such types are called concrete types.


As the type hierarchy chart shows, the default ASCII string type – ASCIIString – and a Unicode equivalent, UTF32String, derive from DirectIndexString. DirectIndexString, in turn, derives from AbstractString. Without delving too deeply into the concept of types in Julia, it’s important to spend a moment on explaining what this means.

AbstractString and DirectIndexString are what you might have heard of referred to as interfaces. An interface is a concept in object-oriented programming that allows for a number of classes to interact with the world. As long as a class implements a particular interface (meaning, it responds to the same specifications), a user does not need to know a thing about the particular class to interact with it: he only has to write code that can deal with the interface. AbstractString and DirectIndexString are, as such, not concrete types an object can have – you cannot create an object of type AbstractString, for instance – the result will be an object of, predictably, type ASCIIString or UTF8String. Rather, they are what are called supertypes. This means that as long as a particular object’s type implements the AbstractString interface, a function expecting an AbstractString will be just fine, even if the function has not been written with that particular concrete type in mind. In the following, when we discuss strings, we will discuss types that implement AbstractString.

A string is entered with double or triple double quotes:

julia> "I am a string." "I am a string."

Depending on what it contains, it will be saved by default as the smallest appropriate ByteString format: ASCIIString or, if it contains any non-ASCII UTF-8 characters, a UTF8String.


String indexing and its pitfalls

String indexing works by using square brackets, which can take an actual value, a range (denoted by a colon) and an end offset (denoted by end-n, where n is an integer). Those coming to Julia from R will be delighted to learn that Julia is 1-indexed: that is, the lowest index of a string (or, indeed, any indexable collection) is 1, rather than 0, as would be the case with e.g. Python.

julia> s = "I am reading a great book on Julia!"
 "I am reading a great book on Julia!"  julia> s[3]
 'a'
 julia> s[14:end-1]
 "a great book on Julia"

Eagle-eyed readers might have spotted that the result of a single indexing is surrounded by single quotation marks, which we have noted in the previous subsection indicates a Char datatype. Single indexing of strings returns a Char, but indexing that ends up returning a single character does not – for instance, s[4:4] returns an ASCIIString of length 1. Now let’s have a look at the first few characters of the name of iconic English actor who played Captain Kinross in In Which We Serve.

julia> captain_kinross = "Noël Coward" 
 "Noël Coward"  julia> captain_kinross[3]
 'ë'
 julia> captain_kinross[4]
 LoadError: invalid character index 
 while loading In[14], in expression starting on line 1   
  
     in next at ./unicode/utf8.jl:69
     in getindex at strings/basic.jl:37

Oh – that’s unexpected. Julia, whom we have gotten to known as rather clever, seems to stumble at the task of finding the letter that comes after the ë. The reason is hinted at in the error message. Checking the type of captain_kinross confirms this:

julia> typeof(captain_kinross) UTF8String

Because of the non-ASCII character in poor Sir Noël’s name, Julia has not been able to store it as an ASCIIString, so it stored it as the next best format – a UTF8String. And indexing seems to have gone haywire – but why?


csefalvay_2_03

Figure 2 In a directly indexable string (ASCIIString, above), grapheme and byte boundaries coincide, so Julia’s byte indexing retrieves the right graphemes. In a non-DirectIndexString, where UTF8 is involved, characters outside the ASCII code table can bring graphemes and bytes out of alignment.


As we have seen above, UTF8String is not a DirectIndexString descendant. This is because UTF-8 encodes characters in as few bits as possible – 8, 16 or 32. For this reason, the byte size (‘width’) of a character is not fixed and as such, the string cannot be directly indexed. In Julia, an index of a string means not the n-th character of that string but the n-th byte. Unicode characters have various widths, and indexing them by character rather than byte of the representation is not efficient. Consequently, an index – say, [2] – means not the ordinal index of the character within the array (which could not be efficiently resolved) but rather its byte index. The consequence is that where a string is kept in a non-DirectIndexString type object, bracket indexing will not be reliable.

An easy workaround to this problem, albeit at the expense of memory efficiency, is to convert the string to a DirectIndexString type, namely UTF32String:

julia> the_indexable_captain_kinross = UTF32String("Noël Coward")
 "Noël Coward"  
 julia> the_indexable_captain_kinross[4] 'l'

We hope this article has helped you better understand the anatomy of a string in Julia!