“Regular expressions” are a notation for describing patterns in strings of characters. What is such a “pattern?” A few examples:

Telephone numbers. Although called “numbers,” they of often written in a nonstandard numerical format, e.g.,
- 1. 696-6000
- +1 651.696.6000
Times of day are written according to a pattern. 10:30 AM, 9 o’clock, 13:15.54. Durations are also written in a similar way. The a world class time in the marathon is 2:02:57.
POSTAL CODES. In the US, postal codes are written in several formats:
- five digits, e.g., 55105,
- Zip + 4, 9 digits written 55105-1362
Leading and trailing spaces. Sometimes the content of a string is preceeded or followed by blank statements, like this: " henry ".

Regular expressions are used for several purposes: * to detect whether a pattern is contained in a string. Use filter() and grepl() * to replace the elements of that pattern with something else. Use mutate() and gsub() * to extract a component that matches the patterns. Use extract() from the DCF package.

To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.

NameList <- BabyNames %>% 
  mutate( name=tolower(name) ) %>%
  group_by( name, sex ) %>%
  summarise( total=sum(count) ) %>%
  arrange( desc(total))

Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl() is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.

The name contains “shine”, as in “sunshine” or “moonshine”

NameList %>% 
  filter( grepl( "shine", name ) ) %>% 
  head()

Source: local data frame [6 x 3]
Groups: name

      name sex total
1 sunshine   F  4959
2  shineka   F   150
3    shine   F    95
4    shine   M    44
5 sunshine   M    37
6 shinequa   F    27

The name contains three or more vowels in a row.

NameList %>% 
  filter( grepl( "[aeiou]{3,}", name ) ) %>% 
  head()

Source: local data frame [6 x 3]
Groups: name

      name sex  total
1    louis   M 389910
2   louise   F 331551
3   isaiah   M 177412
4    louie   M  27121
5     beau   M  26693
6 precious   F  18268

The name contains three or more consonants in a row.

NameList %>% 
  filter( grepl( "[^aeiou]{3,}", name ) ) %>% 
  head()

Source: local data frame [6 x 3]
Groups: name

         name sex   total
1 christopher   M 1984307
2     matthew   M 1540182
3     anthony   M 1391462
4      andrew   M 1244667
5     dorothy   F 1105281
6     timothy   M 1057538

The name contains “mn”

NameList %>% 
  filter( grepl( "mn", name ) ) %>% 
  head()

Source: local data frame [6 x 3]
Groups: name

     name sex  total
1  autumn   F 104408
2  sumner   M   2287
3    amna   F   1099
4 domnick   M    405
5  tatumn   F    280
6  autumn   M    258

The first, third, and fifth letters are consonants.

NameList %>% 
  filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>% 
  head()

Source: local data frame [6 x 3]
Groups: name

         name sex   total
1       james   M 5091189
2      robert   M 4789776
3       david   M 3565229
4      joseph   M 2557792
5 christopher   M 1984307
6     matthew   M 1540182

Examples of accomplishing tasks with regular expressions.

Get rid of percent signs and commas in numerals

Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.

head(Debt,3)

         country            debt percentGDP perCapita percentWorldPublicDebt
1          World $56,308 billion        64%     7,936                100.00%
2 United States* $17,607 billion     73.60%    36,653                 31.27%
3          Japan  $9,872 billion    214.30%    77,577                 17.53%

To use these numbers for computations, they must be cleaned up.

Debt %>% 
  mutate( debt=gsub("[$,%]|billion","",debt),
          percentGDP=gsub("[,%]", "", percentGDP)) %>%
  head(3)

         country   debt percentGDP perCapita percentWorldPublicDebt
1          World 56308          64     7,936                100.00%
2 United States* 17607       73.60    36,653                 31.27%
3          Japan  9872      214.30    77,577                 17.53%

Remove a currency sign

gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))

[1] "100.95" "45"

Remove leading or trailing spaces

gsub( "^ +| +$", "", "   My name is Julia     ")

[1] "My name is Julia"

How often do boys’ names end in vowels?

NameList %>%
  filter( grepl( "[aeiou]$", name ) ) %>% 
  group_by( sex ) %>% 
  summarise( total=sum(total) )

Source: local data frame [2 x 2]

  sex    total
1   F 96702371
2   M 21054791

Girls’ names are almost five times as likely to end in vowels as boys’ names.

What are the most common end vowels for names?

To answer this question, you have to extract the last vowel from the name. The extract() transformation function can do this.

You’ll have to bring in the extract() function; it’s not yet a part of the DCF package.

source( url( "http://tinyurl.com/m4o4n2b/DCF/extract.R" ))

NameList %>% 
  extract(data=., "([aeiou])$", name, vowel=1 ) %>%
  group_by( sex, vowel ) %>% 
  summarise( total=sum(total) ) %>%
  arrange( sex, desc(total) )

Source: local data frame [12 x 3]
Groups: sex

   sex vowel     total
1    F    NA  68578358
2    F     a  56088501
3    F     e  36432218
4    F     i   3693024
5    F     o    403120
6    F     u     85508
7    M    NA 147082250
8    M     e  14341114
9    M     o   4041190
10   M     a   1844041
11   M     i    753311
12   M     u     75135

Reading Regular Expressions

There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.

There are many regular expression tutorials on the Internet, for instance this interactive one. The basic structure is outlined here.

Some basics:

Very simple patterns:
- A single . means “any character.”
- A character, e.g., b, means just that character.
- Characters enclosed in square brackets, e.g., [aeiou] means any of those characters. (So, [aeiou] is a pattern describing a vowel.)
- The ^ inside square brackets means “any except these.” So, a consonant is [^aeiou]
Alternatives. A vertical bar means “either.” The regular expression
Repeats
- Two simple patterns in a row, means those patterns consecutively. Example: M[aeiou] means a capital M followed by a lower-case vowel.
- A simple pattern followed by a + means “zero or more times.”
- A simple pattern followed by a ? means “zero or one times.”
- A simple pattern followed by a * means “one or more times.”
- A simple pattern followed by {2} means “exactly two times.” Similarly, {2,5} means between two and five times, {6,} means six times or more.
Start and end of strings. For instance, [aeiou]{2} means “exactly two vowels in a row.”
- ^ at the beginning of a regular expression means “the start of the string”
- $ at the end means “the end of the string.”
Extraction