Introducing regular expressions
A regular expression, or regex, is a pattern to match text. In other words, it allows us to define an abstract string (typically, the definition of a structured kind of text) to check with other strings to see if they match or not.
It is better to describe them with an example. Think of defining a pattern of text as a word that starts with an uppercase A and contains only lowercase "n"s and "a"s after that. Let's show some possible comparisons and results:
Text to compare | Result |
Anna |
Match |
Bob |
No match (No initial A) |
Alice |
No match (l is not n or a after initial A) |
James |
No match (No initial A) |
Aaan |
Match |
Ana |
Match |
Annnn |
Match |
Aaaan |
Match |
ANNA |
No match (N is not n or a) |
Table 1.1: A pattern matching example
If this sounds complicated, that's because it is. Regexes can be notoriously complicated because they may be incredibly intricate and difficult to follow. But they are also very useful because they allow us to perform incredibly powerful pattern matching.
Some common uses of regexes are:
- Validating input data: For example, a phone number that is only numbers, dashes, and brackets.
- String parsing: Retrieve data from structured strings, such as logs or URLs. This is similar to what's described in the previous recipe.
- Scrapping: Find the occurrences of something in a long piece of text. For example, find all of the emails in a web page.
- Replacement: Find and replace a word or words with others. For example, replace the owner with John Smith.
Getting ready
The python module to deal with regexes is called re
. The main function we'll cover is re.search()
, which returns a match
object with information about what matched the pattern.
As regex patterns are also defined as strings, we'll differentiate them by prefixing them with an r
, such as r'pattern'
. This is the Python way of labeling a text as raw string literals, meaning that the string within is taken literally, without any escaping. This means that a "\
" is used as a backslash instead of an escaping sequence. For example, without the r
prefix, \n
means a newline character.
Some characters are special and refer to concepts such as the end of the string, any digit, any character, any whitespace character, and so on.
The simplest form is just a literal string. For example, the regex pattern r'LOG'
matches the string 'LOGS'
, but not the string 'NOT A MATCH'
. If there's no match, re.search
returns None
. If there is, it returns a special Match
object:
>>> import re
>>> re.search(r'LOG', 'LOGS')
<_sre.SRE_Match object; span=(0, 3), match='LOG'>
>>> re.search(r'LOG', 'NOT A MATCH')
>>>
How to do it…
- Import the
re
module:>>> import re
- Then, match a pattern that is not at the start of the string:
>>> re.search(r'LOG', 'SOME LOGS') <_sre.SRE_Match object; span=(5, 8), match='LOG'>
- Match a pattern that is only at the start of the string. Note the
^
character:>>> re.search(r'^LOG', 'LOGS') <_sre.SRE_Match object; span=(0, 3), match='LOG'> >>> re.search(r'^LOG', 'SOME LOGS') >>>
- Match a pattern only at the end of the string. Note the
$
character:>>> re.search(r'LOG$', 'SOME LOG') <_sre.SRE_Match object; span=(5, 8), match='LOG'> >>> re.search(r'LOG$', 'SOME LOGS') >>>
- Match the word
'thing'
(not excludingthings
), but notsomething
oranything
. Note the\b
at the start of the second pattern:>>> STRING = 'something in the things she shows me' >>> match = re.search(r'thing', STRING) >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():] ('some', 'thing', ' in the things she shows me') >>> match = re.search(r'\bthing', STRING) >>> STRING[:match.start()], STRING[match.start():match.end()], STRING[match.end():] ('something in the ', 'thing', 's she shows me')
- Match a pattern that's only numbers and dashes (for example, a phone number). Retrieve the matched string:
>>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890') <_sre.SRE_Match object; span=(20, 32), match='1234-567-890'> >>> re.search(r'[0123456789-]+', 'the phone number is 1234-567-890').group() '1234-567-890'
- Match an email address naively:
>>> re.search(r'\S+@\S+', 'my email is [email protected]').group() '[email protected]'
How it works…
The re.search
function matches a pattern, no matter its position in the string. As explained previously, this will return None
if the pattern is not found, or a Match
object.
The following special characters are used:
^
: Marks the start of the string$
: Marks the end of the string\b
: Marks the start or end of a word\S
: Marks any character that's not a whitespace, including characters like* or $
More special characters are shown in the next recipe, Going deeper into regular expressions.
In step 6 of the How to do it section, the r'[0123456789-]
+' pattern is composed of two parts. The first one is between square brackets, and matches any single character between 0
and 9
(any number) and the dash (-) character. The + sign after that means that this character can be present one or more times. This is called a quantifier in regexes. This makes a match on any combination of numbers and dashes, no matter how long it is.
Step 7 again uses the + sign to match as many characters as necessary before the @
and again after it. In this case, the character match is \S
, which matches any non-whitespace character.
Please note that the naive pattern for emails described here is very naive, as it will match invalid emails such as john@[email protected]
. A better regex for most uses is r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)"
. You can go to http://emailregex.com/ to find it, along with links to more information.
Note that parsing a valid email including corner cases is actually a difficult and challenging problem. The previous regex should be fine for most uses covered in this book, but in a general framework project such as Django, email validation is a very long and hard-to-read regex.
The resulting matching object returns the position where the matched pattern starts and ends (using the start
and end
methods), as shown in step 5, which splits the string into matched parts, showing the distinction between the two matching patterns.
The difference displayed in step 5 is a very common one. Trying to capture GP (as in General Practitioner, for a medical doctor) can end up capturing eggplant and bagpipe! Similarly, things\b
won't capture things. Be sure to test and make the proper adjustments, such as capturing \bGP\b
for just the word GP.
The specific matched pattern can be retrieved by calling group()
, as shown in step 6. Note that the result will always be a string. It can be further processed using any of the methods that we've previously seen, such as by splitting the phone number into groups by dashes, for example:
>>> match = re.search(r'[0123456789-]+', 'the phone number is 1234-567-890')
>>> [int(n) for n in match.group().split('-')]
[1234, 567, 890]
There's more…
Dealing with regexes can be difficult and complex. Please allow time to test your matches and be sure that they work as you expect in order to avoid nasty surprises.
"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
– Jamie Zawinski
Regular expressions are at their best when they are kept very simple. In general, if there is a specific tool to do it, prefer it over regexes. A very clear example of this is with HTML parsing; refer to Chapter 3, Building Your First Web Scraping Application, for better tools to achieve this.
Some text editors allow us to search using regexes as well. While most are editors aimed at writing code, such as Vim, BBEdit, or Notepad++, they're also present in more general tools, such as MS Office, Open Office, or Google Documents. But be careful, as the particular syntax may be slightly different.
You can check your regexes interactively with some tools. A good one that's freely available online is https://regex101.com/, which displays each of the elements and explains the regex. Double-check that you're using the Python flavor:
Figure 1.1: An example using RegEx101
Note that the EXPLANATION box in the preceding image describes that \b
matches a word boundary (the start or end of a word), and that thing matches literally these characters.
Regexes, in some cases, can be very slow, or even susceptible to what's called a regex denial-of-service attack, a string created to confuse a particular regex so that it takes an enormous amount of time. In the worst-case scenario, it can even block the computer. While automating tasks probably won't get you into those problems, keep an eye out in case a regex takes too long to process.
See also
- The Extracting data from structured strings recipe, covered earlier in the chapter, to learn simple techniques to extract information from text.
- The Using a third-party tool—parse recipe, covered earlier in the chapter, to use a third-party tool to extract information from text.
- The Going deeper into regular expressions recipe, covered later in the chapter, to further your knowledge of regular expressions.