Characters and strings
The simplest character-based variables consist of ASCII and Unicode characters.
A single character is delimited by single quotes, whereas a string uses double quotes or, in some cases, triple-double quotes (“””), which is discussed in this section.
A string can be viewed as a one-dimensional array of characters and can be indexed and manipulated in a similar fashion as an array of numeric values:
julia>
s = "Hi there, Blue Eyes!" "Hi there, Blue Eyes!"julia>
length(s) 20julia>
s[11] 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)julia>
s[end] '!': ASCII/Unicode U+0021 (category Po: Punctuation, other)
Hint—Try evaluating the following list comprehension: [s[i] for i =
length(s):-1:1]
.
Characters
Observe that Julia has a built-in Char
type to represent a character.
A character occupies 32 bits, not 8, which is why it can hold a Unicode character. Have a look at the following example:
# All the following represent the ASCII character capital-Ajulia>
c = 'A';julia>
c = Char(65);julia>
c = '\U0041' 'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
Julia supports Unicode code, as we see here:
julia>
c = '\Uc041'
'': Unicode U+c041 (category Lo: Letter, other)
As such, we can output characters from a variety of different alphabets—for example, Chinese:
julia>
'\U7537'
'男': Unicode U+7537 (category Lo: Letter, other)
It is possible to specify a character code of '\Uffff'
but char
conversion does not check that every value is valid. However, Julia provides an isvalid()
function that can be applied to characters:
julia>
c = '\Udff3'; isvalid(c)
false
Julia uses the special C-like syntax for certain ASCII control characters such as '\b'
, '\t'
, '\n'
, '\r'
, and 'f'
for backspace, tab, newline, carriage-return, and form-feed, respectively.
The backslash acts as an escape character, so Int('\s') => 115
, whereas Int('\t') =>
9
.
If more than one character is supplied between the single quotes, this raises an error:
julia>
'Hello'
ERROR: syntax: character literal contains multiple characters
Strings
The type of string we are most familiar with comprises a list of ASCII characters that, as we have observed, are normally delimited with double quotes, as in the following example:
julia>
s = "Hello there, Blue Eyes";julia>
typeof(s) String
The following points are worth noting:
- The built-in concrete type used for strings (and string literals) is
String
- This supports the full range of Unicode characters via UTF-8 encoding
- All string types are subtypes of the
AbstractString
abstract type, so when defining a function expecting a string argument, you should declare the type asAbstractString
in order to accept any string type
A transcode()
function can be used to convert to/from other Unicode encodings:
julia>
s = "αβγ";julia>
transcode(UInt16, s) 3-element Vector{UInt16}: 0x03b1 0x03b2 0x03b3
In Julia (as in Java), strings are immutable—that is, the value of a String
object cannot be changed. To construct a different string value, you construct a new string from parts of other strings. Let’s look at this in more detail:
- ASCII strings are indexable, so from
s
as defined previously:s[14:17] # => "
Blue"
. - The values in the range are inclusive, and if we wish, we can change the increment to
s[14:2:17] => "Bu"
or reverse the slice tos[17:–1:14] => "
eulB"
. - Omitting the end of the range is equivalent to running to the end of the string:
s[14:] => "
Blue Eyes"
. - However,
s[:14]
is somewhat unexpected and gives the character'B'
, not the string up to and includingB
. This is because':'
defines a “symbol
”, and for a literal,:14
is equivalent to14
, sos[:14]
is the same ass[14]
and nots[1:14]
. - The final character in a string can be indexed using the notation end, so in this case,
s[end]
is equal to the'
s'
character.
Strings allow for special characters such as \n
, \t
, and so on.
If we wish to include the double quotes, we can escape them, but Julia provides a """
delimiter.
So, s = "This is the double quote \" character"
and s = """This is the double quote " character"""
are equivalent:
julia>
s = "This is a double quote \" character."; println(s);
This is a double quote " character.
Strings also provide the “$"
convention when displaying the value of a variable:
julia>
age = 21; s = "I've been $age for many years now!"
I've been 21 for many years now!
Concatenation of strings can be done using the $ convention, but Julia also uses the '*'
operator (rather than '+'
or some other symbol):
julia>
s = "Who are you?";julia>
t = " said the Caterpillar."julia>
s*t or "$s$t" # => "Who are you? said the Caterpillar."
Note
Here’s how a Unicode string can be formed by concatenating a series of characters:
julia>
'\U7537'*'\U4EBA'
“男人’’
Regular expressions
Regular expressions (regexes) came to prominence with their inclusion in Perl programming.
There is an old Perl programmer’s adage: “I had a problem and decided to solve it using regular expressions; now, I have two problems.”
Regexes are used for pattern matching; numerous books have been written on them, and support is available in a variety of programming languages post-Perl, notably Java and Python. Julia supports regexes via a special form of string prefixed with r
.
Suppose we define an empat
pattern as follows:
julia>
empat = r"^\S+@\S+\.\S+$"julia>
typeof(empat) Regex
The following example will give a clue to what the pattern is associated with:
julia>
occursin(empat, "[email protected]") truejulia>
occursin(empat, "Fredrick [email protected]") false
The pattern is for a valid (simple) email address, and in the second case, the space in Fredrick Flintstone
is not valid (because it contains a space!), so the match fails.
Since we may wish to know not only whether a string matches a certain pattern but also how it is matched, Julia has a match()
function:
julia>
m = match(r"@bedrock","barney,[email protected]")
RegexMatch(„@bedrock")
If this matches, the function returns a RegexMatch
object; otherwise, it returns Nothing
:
julia>
m.match "@bedrock"julia>
m.offset 14julia>
m.captures 0-element Array{Union{Nothing,SubString{String}},1}
A detailed discussion of regexes is beyond the scope of this book.
The following link provides a good online source for all things regex, including an excellent cheat sheet via the Quick Reference page: https://www.rexegg.com.
In addition, there are a number of books on the subject, and a free PDF can be downloaded from the following link:
https://www.academia.edu/22080976/Regular_expressions_cookbook_2nd_edition.
Version strings
Version numbers can be expressed with non-standard string literals as v“…”.
These literals create VersionNumber
objects that follow the specifications of “semantic versioning” and therefore are composed of major, minor, and patch numeric values, followed by pre-release and build alpha-numeric annotations.
So, a full specification typically would be “v1.9.1-rc1”, where the major version is “1”, minor version “9”, patch level “1”, and release candidate “1”.
Currently, only the major version needs to be provided, and the others will assume default values; for example, “v1” is equivalent to “v1.0.0”.
(The release candidate has no default, so needs to be explicitly defined.)
Byte array literals
Another special form is the b“…” byte array literal, which permits string notation to express arrays of UInt8
values.
These are the rules for byte array literals:
- ASCII characters and ASCII escape sequences produce a single byte
-
\x
and octal escape sequences produce a byte corresponding to the escape value - Unicode escape sequences produce a sequence of bytes encoding that code points in UTF-8
Consider the following two examples:
julia>
A = b"HEX:\xefcc" 7-element Base.CodeUnits{UInt8,String}: [0x48,0x45,0x58,0x3a,0xef,0x63,0x63]julia>
B = b"\u2200 x \u2203 y" 11-element Base.CodeUnits{UInt8,String}: 0xe2 0x88 0x80 0x20 0x78 0x20 0xe2 0x88 0x83 0x20 0x79
Here, the first three elements represent the \u2200
code, then 0x20
,0x78
,0x20
correspond to <space>x<space>
, followed by three more elements for the \u2203
code, and finally, 0x20
, 0x79
, which represents <space>y
.