Chapter 16 Using Regular Expressions

A regular expression, often called regex, is a way of describing patterns in strings of characters. What is such a pattern? A few examples:

  • Telephone numbers. Although called “numbers,” they are often written in a nonstandard numerical format, e.g.,
    • (651) 696-6000 +1 651.696.6000
  • Times of day are written according to a pattern, e.g.
    10:30 AM, 9 o’clock, 13:15.54
    Durations are also written in a similar way. The world class time in the marathon is 2:02:57, just over two hours.
  • POSTAL CODES. In the US, postal codes are written in several formats:
    • five digits, e.g., 55105,
    • Zip + 4, 9 digits written 55105-1362
  • Leading and trailing spaces. Sometimes the content of a string is preceeded or followed by blank statements, like this:
    " henry "

Regular expressions are used for several purposes:

  • to detect whether a pattern is contained in a string. Use filter() and grepl().
  • to replace the elements of that pattern with something else. Use mutate() and gsub().
  • to extract a component that matches the patterns. Use extract() from the DataComputing package.

To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.

NameList <- BabyNames %>% 
  group_by( name, sex ) %>%
  summarise( total=sum(count) ) %>%
  arrange( desc(total)) 

Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl() is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.

  • The name contains “shine”, as in “sunshine” or “moonshine” or “Shinelle”

    NameList %>% 
      filter(grepl("shine", name, ignore.case = TRUE)) 
    name sex total
    Sunshine F 4959
    Shineka F 150
    … and so on for 16 rows altogether.
  • The name contains three or more vowels in a row.

    NameList %>% 
      filter(grepl("[aeiou]{3,}", name, ignore.case = TRUE)) 
    name sex total
    Louis M 389910
    Louise F 331551
    … and so on for 1,933 rows altogether.
  • The name contains three or more consonants in a row.

    NameList %>% 
      filter(grepl("[^aeiou]{3,}", name, ignore.case = TRUE) ) 
    name sex total
    Christopher M 1984307
    Matthew M 1540182
    … and so on for 16,767 rows altogether.
  • The name contains “mn”

    NameList %>% 
      filter(grepl("mn", name, ignore.case = TRUE)) 
    name sex total
    Autumn F 104408
    Sumner M 2287
    … and so on for 51 rows altogether.
  • The first, third, and fifth letters are consonants.

    NameList %>% 
      filter(grepl("^[^aeiou].[^aeiou].[^aeiou]", name,
               ignore.case = TRUE)) 
    name sex total
    James M 5091189
    Robert M 4789776
    … and so on for 28,747 rows altogether.

16.1 Regex basics

There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.

Regexes encode patterns. The simplest patterns can be taken literally. For instance, "able" will match any string with “able” in it, e.g., “You are able …”, “She is capable”, “The motion is tabled”. But "able" will not match “Able to do what?” Capital letters are distinct from lower case, although this can be turned off in many functions with the argument ignore.case = TRUE, as used in the previous examples based on BabyNames.

More complicated patterns involve punctuation that has a special meaning. These characters have a special meaning: ., ^, $, [, |, ], {, }

For example, ".ble" is a regex that matches any set of four letters ending with “ble”. This will match all the strings matched by "able" but others as well: “sugar is soluble in water”, “money is fungible”, “Bible reading for today”.

  • Simple patterns:
    • A single . means “any character.”
    • A character, e.g., b, means just that character.
    • Characters enclosed in square brackets, e.g., [aeiou] means any one of those characters. (So, [aeiou] is a pattern describing a vowel.)
    • The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
  • Alternatives. A vertical bar means “either.” For instance, the regex "rain|snow|sleet|hail" will match any of those four forms of precipitation.
  • Two simple patterns in a row, means those patterns consecutively. Example: "M[aeiou]" means a capital M followed by a single lower-case vowel.
  • Repeats
    • A pattern followed by a + means “zero or more times.” So, `“M[aeiou]+” matches “Miocene”, “Mr.”, “Ms”, “Mrs.”, “Miss”, and “Man”, among many others.
    • A simple pattern followed by a ? means “zero or one times.” So, "M[aeiou]?[^aeiou]" will not match “Miocene”. (The regex means: M followed by at most one vowel, followed by a consonant. “Miocene” has two vowels following the M.)
    • A simple pattern followed by a * means “one or more times.” "M[aeiou]*" will match “Miocene” and “Miss”, but not “Mr.”
    • A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more. For instance, [aeiou]{2} means “exactly two vowels in a row” and will not match “Miss” or “Mr” or “Madam”.
  • Start and end of strings.
    • ^ at the beginning of a regular expression means “the start of the string”
    • $ at the end means “the end of the string.”
  • Extraction: Some functions that process regexes will extract the specific character sequence that matches the specified sequence. To specify which part of the match is to be extracted, put it inside open and close parentheses. For example, the regex "([aeiouAEIOU]{4})" in following statement will match all names with 4 consecutive vowels of either upper or lower case and extract those vowels as a new variable.
BabyNames %>%
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  extractMatches("([aeiouAEIOU]{4})", name) %>%
  filter( ! is.na(match1)) %>%
  arrange(desc(total)) 

Table 16.1: “Baby names with 4 consecutive vowels captured by the regex "([eaiouAEIOU]\\{4\\})"

name total match1
Louie 29293 ouie
Sequoia 3275 uoia
Gioia 480 ioia
Keiaira 81 eiai
Zoiee 77 oiee
Daaiel 76 aaie
… and so on for 39 rows altogether.

On occasion, you will want to use one of the regex control characters in a literal sense, for example . to mean “period” and not “any one character.” You can do this by “escaping” that character by preceeding it with a double \\. For instance, in the regex "^\\$|¥|£|¢$" about currency symbols, the first $ is intended to refer to the dollar currency sign. But on its own in a regex, $ refers to the end of a character string. In order to signal that the first $ is to be taken literally as a currency symbol, it’s been escaped with \\. To create a regex involving the period found at the end of sentences, you need to escape the ., like this: "\\.$"

16.2 Example tasks with regular expressions.

Get rid of percent signs and commas in numerals

Numbers often come with comma separators or unit symbols such as % or $. For instance, Table 16.2 shows part of a table about public debt from Wikipedia.

Table 16.2: Public debt (from Wikipedia)}

country debt percGDP
World $56,308 billion 64%
United States* $17,607 billion 73.60%
Japan $9,872 billion 214.30%
China $3,894 billion 31.70%
Germany $2,592 billion 81.70%
Italy $2,334 billion 126.10%
… and so on for 28 rows altogether.

To use these numbers for computations, they must be cleaned up.

Debt %>% 
  mutate( debt=gsub("[$,%]|billion","",debt),
          percGDP=gsub("[,%]", "", percGDP)) 

Table 16.3: Remove comma separators, etc. to create a pure number.}

country debt percGDP
World 56308 64
United States* 17607 73.60
Japan 9872 214.30
China 3894 31.70
Germany 2592 81.70
Italy 2334 126.10
… and so on for 28 rows altogether.

Remove a currency sign

gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))
## [1] "100.95" "45"

Remove leading or trailing spaces

gsub("^ +| +$", "", "   My name is Julia     ")
## [1] "My name is Julia"

How often do names end in vowels?

NameList %>%
  filter( grepl( "[aeiou]$", name ) ) %>% 
  group_by( sex ) %>% 
  summarise( total=sum(total) )

Table 16.4: The number of babies given names ending in a vowel.

sex total
F 96702371
M 21054791

Girls’ names are almost five times as likely to end in vowels as boys’ names.

16.2.1 Example: What are the most common ending vowels for names?

To answer this question, you have to extract the last vowel from the name. The regex ".*([aeiou])$" means "any characters followed with one of aeiou immediately before the end of the string. The parentheses in the regex instruct extractMatches() to pull out the part of the match with the regex corresponding to the parentheses’s contents.

regex <- ".*([aeiou])$"
NameList %>% 
  filter(grepl(regex, name)) %>% 
  mutate(vowel = gsub(regex, "\\1", name)) %>%
  group_by( sex, vowel ) %>% 
  summarise( total=sum(total) ) %>%
  arrange( desc(total) ) %>%
  spread(key=sex, value=total)

Table 16.5: The count of babies with names ending in each vowel.

vowel F M
a 56088501 1844041
e 36432218 14341114
i 3693024 753311
o 403120 4041190
u 85508 75135

Table 16.5 shows that a name ending in “a”, “e”, or “i” suggests a girl. Names ending in “o” are boy’s.

16.3 Exercises

chapter_exercises(“Exercise-Roster.csv”, “^16-Regex$”, “16.”)