Chapter 17 Using Regular Expressions

A regular expression, often called regex, is a way of describing patterns in strings of characters. What is such a pattern? A few examples:

Telephone numbers. Although called “numbers,” they are often written in a nonstandard numerical format, e.g.,

(651) 696-6000
+1 651.696.6000
Times of day are written according to a pattern, e.g.

10:30 AM
9 o'clock
13:15.54

Durations are also written in a similar way. [Click to see note.]Kipchoge’s October 2019 accomplishment was disqualified as an official world record becuase it was done under ‘artificial’ conditions, although it’s worth noting that Kipchoge was already the official world record holder at the time after completing the 2018 Berlin Marathon in a time of 2:01:39 Eliud Kipchoge was the first person to run a marathon (26.2 miles) in less than two hours did so in 1:59:40 during October 2019.
Postal codes. In the US, postal codes are written in several formats:

16801 (five digit format)
55105-1362 (ZIP + 4; 9 digit format)
Leading and trailing spaces. Sometimes the content of a string is preceeded or followed by blank statements, like this:

" henry "

Regular expressions are used for several purposes:

Use filter() and grepl() to detect whether a pattern is contained in a string.
Use mutate() and gsub() to replace the elements of that pattern with something else.
Use str_extract() from stringr to extract a component that matches the patterns (loaded with the tidyverse package).

To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.

NameList <- 
  BabyNames %>% 
  group_by(name, sex) %>%
  summarise(total = sum(count)) %>%
  arrange(desc(total))

Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl() is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.

The name contains “shine”, as in “sunshine” or “moonshine” or “Shinelle”

NameList %>% 
  filter(grepl("shine", name, ignore.case = TRUE))

name	sex	total
Rashine	M	11
Shine	F	95
… and so on for 16 rows altogether.

The name contains three or more vowels in a row.

NameList %>% 
  filter(grepl("[aeiou]{3,}", name, ignore.case = TRUE))

name	sex	total
Louis	M	389910
Louise	F	331551
Isaiah	M	177412
… and so on for 1,933 rows altogether.

The name contains three or more consonants in a row.

NameList %>% 
  filter(grepl("[^aeiou]{3,}", name, ignore.case = TRUE) )

name	sex	total
Christopher	M	1984307
Matthew	M	1540182
Anthony	M	1391462
… and so on for 16,767 rows altogether.

The name contains “mn”

NameList %>% 
  filter(grepl("mn", name, ignore.case = TRUE))

name	sex	total
Autumn	F	104408
Sumner	M	2287
Amna	F	1099
… and so on for 51 rows altogether.

The first, third, and fifth letters are consonants.

NameList %>% 
  filter(grepl("^[^aeiou].[^aeiou].[^aeiou]", name,
               ignore.case = TRUE))

name	sex	total
James	M	5091189
Robert	M	4789776
David	M	3565229
… and so on for 28,747 rows altogether.

17.1 RegEx basics

A regular expression (RegEx) maybe be simple or complicated ones. All of them look foreign until you learn how to read them.

Regexes encode patterns. The simplest patterns can be taken literally. For instance, "able" will match any string with “able” in it, e.g., “You are able …”, “She is capable”, “The motion is tabled”. But "able" will not match “Able to do what?” Capital letters are distinct from lower case, although this can be turned off in many functions with the argument ignore.case = TRUE, as used in the previous examples based on BabyNames.

More complicated patterns involve punctuation that has a special meaning. These characters have a special meaning: ., ^, $, [, |, ], {, }

For example, ".ble" is a regex that matches any set of four letters ending with “ble”. This will match all the strings matched by "able" but others as well: “sugar is soluble in water”, “money is fungible”, “Bible reading for today”.

Simple patterns:

A single . means “any character.”
A character, e.g., b, means just that character.
Characters enclosed in square brackets, e.g., [aeiou] means any one of those characters. (So, [aeiou] is a pattern describing a vowel.)
The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]

Alternatives: A vertical bar means “either.” For instance, the regex "rain|snow|sleet|hail" will match any of those four forms of precipitation.

Two simple patterns in a row, means those patterns consecutively. Example: "M[aeiou]" means a capital M followed by a single lower-case vowel.

Repeats

A pattern followed by a + means “zero or more times.”

So, "M[aeiou]+" matches “Miocene”, “Mr.”, “Ms”, “Mrs.”, “Miss”, and “Man”, among many others.
A simple pattern followed by a ? means “zero or one times.”

So, "M[aeiou]?[^aeiou]" will not match “Miocene”. (The regex means: M followed by at most one vowel, followed by a consonant. “Miocene” has two vowels following the M.)
A simple pattern followed by a * means “one or more times.”

"M[aeiou]*" will match “Miocene” and “Miss”, but not “Mr.”
A simple pattern followed by {2} means “exactly two times.”

Similarly, {2,5} means between two and five times, {6,} means six times or more. For instance, [aeiou]{2} means “exactly two vowels in a row” and will not match “Miss” or “Mr” or “Madam”.

Start and end of strings:

^ at the beginning of a regular expression means “the start of the string”
$ at the end means “the end of the string.”

Extraction:

Some functions that process regexes will extract the specific character sequence that matches the specified sequence. To specify which part of the match is to be extracted, put it inside open and close parentheses. For example, the regex "([aeiouAEIOU]{4})" in following statement will match all names with 4 consecutive vowels of either upper or lower case and extract those vowels as a new variable.

BabyNames %>%
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  mutate(match = str_extract(string = name, pattern = "([aeiouAEIOU]{4})")) %>%
  filter( !is.na(match)) %>%
  arrange(desc(total))

Table 17.1: “Baby names with 4 consecutive vowels captured by the regex "([eaiouAEIOU]\\{4\\})"”

name	total	match
Louie	29293	ouie
Sequoia	3275	uoia
Gioia	480	ioia
Keiaira	81	eiai
Zoiee	77	oiee
Daaiel	76	aaie
… and so on for 39 rows altogether.

On occasion, you will want to use one of the regex control characters in a literal sense, for example . to mean “period” and not “any one character.” You can do this by “escaping” that character by preceeding it with a double backslash such as \\. in this case. For instance, in the regex "^\\$|¥|£|¢$" about currency symbols, the first $ is intended to refer to the dollar currency sign. But on its own in a regex, $ refers to the end of a character string. We therefore escape the first dollar sign using \\$ to be taken literally as a currency symbol. To create a regex involving the period found at the end of sentences, you need to escape the ., like this: "\\.$"

17.2 Example tasks with regular expressions.

Get rid of percent signs and commas in numerals

Numbers often come with comma separators or unit symbols such as % or $. For instance, Table 17.2 shows part of a table about public debt from Wikipedia.

Table 17.2: Public debt (from Wikipedia)

country	debt	percGDP
World	$56,308 billion	64%
United States*	$17,607 billion	73.60%
Japan	$9,872 billion	214.30%
China	$3,894 billion	31.70%
Germany	$2,592 billion	81.70%
Italy	$2,334 billion	126.10%
… and so on for 28 rows altogether.

To use these numbers for computations, they must be cleaned up.

Debt %>% 
  mutate(debt = gsub(pattern = "[$,%]|billion", replacement = "", debt),
         percGDP = gsub(pattern = "[,%]", replacement = "", percGDP))

Table 17.3: Remove comma separators, etc. to create a pure number.

country	debt	percGDP
World	56308	64
United States*	17607	73.60
Japan	9872	214.30
China	3894	31.70
Germany	2592	81.70
Italy	2334	126.10
… and so on for 28 rows altogether.

Remove a currency sign

currency <- c("$100.95", "45¢")
gsub(pattern = "^\\$|€|¥|£|¢$", replacement = "", currency)

## [1] "100.95" "45"

Remove leading or trailing spaces

string <- "   My name is Julia     "
gsub(pattern = "^ +| +$", replacement = "", string)

## [1] "My name is Julia"

How often do names end in vowels?

NameList %>%
  filter( grepl( "[aeiou]$", name ) ) %>% 
  group_by( sex ) %>% 
  summarise( total = sum(total) )

Table 17.4: The number of babies given names ending in a vowel.

sex	total
F	96702371
M	21054791

Girls’ names are almost five times as likely to end in vowels as boys’ names.

17.2.1 Example: What are the most common ending vowels for names?

To answer this question, you have to extract the last vowel from the name. The regex ".*([aeiou])$" means "any characters followed with one of aeiou immediately before the end of the string. The parentheses in the regex instruct gsub() to pull out the part of the match with the regex corresponding to the parentheses’s contents.

Table 17.5 shows that names ending in “a”, “e”, or “i” are more common among girls. Names ending in “o” are more frequently boys.

regex <- ".*([aeiou])$"
NameList %>% 
  filter(grepl(pattern = regex, name)) %>% 
  mutate(vowel = gsub(pattern = regex, replacement = "\\1", name)) %>%
  group_by( sex, vowel ) %>% 
  summarise( total = sum(total) ) %>%
  arrange( desc(total) ) %>%
  pivot_wider(names_from = sex, values_from = total)

Table 17.5: The count of babies with names ending in each vowel.

vowel	F	M
a	56088501	1844041
e	36432218	14341114
o	403120	4041190
i	3693024	753311
u	85508	75135

17.3 Exercises

Problem 17.1: Using the BabyNames data table, find the 10 most popular names according to each of these criteria.

Boys’ names ending in a vowel.
Names with a space (like Lee Ann)
Names ending with “joe” (like BettyJoe)

Problem 17.2: Construct a regular expression that matches names that sound like “Caitlin” — A k or c followed by one or more vowels followed by tl then one or more vowels, and ending with n.

Problem 17.3: Here is a character string with a regular expression:

"([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

To explain the first bit …

[2-9] means "one digit from 2 to 9.
[0-9] refers to one digit from 0 to 9.
[0-9]{2} refers to two consecutive digits, 0 to 9.
[2-9][0-9]{2} means one digit 2 to 9 followed by two digits 0 to 9
[- .] means "any of the characters dash, space, period, just once.
The parentheses refer to the matching contents to be extracted. The whole expression has the structure (stuff)[- .](more stuff)[- .](still more stuff). The three sets of parentheses mean to extract those three pieces from strings that match.

Explain what familiar kinds of strings the entire general expression would match. (Hint: Call me maybe.) What components of those strings are being extracted?

Problem 17.4: Consider this regular expression (ignore line breaks):

"(A[LKSZRAP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|
K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|
RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])"

How long will the strings be that match the pattern?
How many different strings will match?
People living in the United States may be able to figure out what the pattern is meant to express. Give it a try.

Problem 17.5: A list of names from the Bible can be accessed like this:

BibleNames <- 
  readr::read_csv("https://mdbeckman.github.io/dcSupplement/data/BibleNames.csv")

Using the names in BibleNames,

Which names have any of these words in them: “bar”, “dam”, “lory”?
Which names end with those words?

You need only show a few.

Problem 17.6: A list of names from the Bible can be accessed like this:

BibleNames <- 
  readr::read_csv("https://mdbeckman.github.io/dcSupplement/data/BibleNames.csv")

Using BibleNames, make an informative plot showing, year-by-year the proportion of all baby names that are Bible-related.

Problem 17.7: A list of names from the Bible can be accessed like this:

BibleNames <- 
  readr::read_csv("https://mdbeckman.github.io/dcSupplement/data/BibleNames.csv")

Using BabyNames, make a data table showing the total number of babies, adding up over all the years, given each Bible-related name. Display the ten most popular names.