“Regular expressions” are a notation for describing patterns in strings of characters. What is such a “pattern?” A few examples:
" henry "
.Regular expressions are used for several purposes: * to detect whether a pattern is contained in a string. Use filter()
and grepl()
* to replace the elements of that pattern with something else. Use mutate()
and gsub()
* to extract a component that matches the patterns. Use extract()
from the DCF package.
To illustrate, consider the baby names data, summarised to give the total count of each name for each sex.
NameList <- BabyNames %>%
mutate( name=tolower(name) ) %>%
group_by( name, sex ) %>%
summarise( total=sum(count) ) %>%
arrange( desc(total))
Here are some examples of patterns in names and the use of a regular expression to detect them. The regular expression is the string in quotes. grepl()
is a function that compares a regular expression to a string, returning TRUE if there’s a match, FALSE otherwise.
The name contains “shine”, as in “sunshine” or “moonshine”
NameList %>%
filter( grepl( "shine", name ) ) %>%
head()
Source: local data frame [6 x 3]
Groups: name
name sex total
1 sunshine F 4959
2 shineka F 150
3 shine F 95
4 shine M 44
5 sunshine M 37
6 shinequa F 27
The name contains three or more vowels in a row.
NameList %>%
filter( grepl( "[aeiou]{3,}", name ) ) %>%
head()
Source: local data frame [6 x 3]
Groups: name
name sex total
1 louis M 389910
2 louise F 331551
3 isaiah M 177412
4 louie M 27121
5 beau M 26693
6 precious F 18268
NameList %>%
filter( grepl( "[^aeiou]{3,}", name ) ) %>%
head()
Source: local data frame [6 x 3]
Groups: name
name sex total
1 christopher M 1984307
2 matthew M 1540182
3 anthony M 1391462
4 andrew M 1244667
5 dorothy F 1105281
6 timothy M 1057538
NameList %>%
filter( grepl( "mn", name ) ) %>%
head()
Source: local data frame [6 x 3]
Groups: name
name sex total
1 autumn F 104408
2 sumner M 2287
3 amna F 1099
4 domnick M 405
5 tatumn F 280
6 autumn M 258
NameList %>%
filter( grepl( "^[^aeiou].[^aeiou].[^aeiou]", name ) ) %>%
head()
Source: local data frame [6 x 3]
Groups: name
name sex total
1 james M 5091189
2 robert M 4789776
3 david M 3565229
4 joseph M 2557792
5 christopher M 1984307
6 matthew M 1540182
Numbers often come with comma separators or unit symbols such as % or $. For instance, here is part of a table about public debt from Wikipedia.
head(Debt,3)
country debt percentGDP perCapita percentWorldPublicDebt
1 World $56,308 billion 64% 7,936 100.00%
2 United States* $17,607 billion 73.60% 36,653 31.27%
3 Japan $9,872 billion 214.30% 77,577 17.53%
To use these numbers for computations, they must be cleaned up.
Debt %>%
mutate( debt=gsub("[$,%]|billion","",debt),
percentGDP=gsub("[,%]", "", percentGDP)) %>%
head(3)
country debt percentGDP perCapita percentWorldPublicDebt
1 World 56308 64 7,936 100.00%
2 United States* 17607 73.60 36,653 31.27%
3 Japan 9872 214.30 77,577 17.53%
gsub("^\\$|€|¥|£|¢$","", c("$100.95", "45¢"))
[1] "100.95" "45"
gsub( "^ +| +$", "", " My name is Julia ")
[1] "My name is Julia"
NameList %>%
filter( grepl( "[aeiou]$", name ) ) %>%
group_by( sex ) %>%
summarise( total=sum(total) )
Source: local data frame [2 x 2]
sex total
1 F 96702371
2 M 21054791
Girls’ names are almost five times as likely to end in vowels as boys’ names.
To answer this question, you have to extract the last vowel from the name. The extract()
transformation function can do this.
You’ll have to bring in the extract()
function; it’s not yet a part of the DCF package.
source( url( "http://tinyurl.com/m4o4n2b/DCF/extract.R" ))
NameList %>%
extract(data=., "([aeiou])$", name, vowel=1 ) %>%
group_by( sex, vowel ) %>%
summarise( total=sum(total) ) %>%
arrange( sex, desc(total) )
Source: local data frame [12 x 3]
Groups: sex
sex vowel total
1 F NA 68578358
2 F a 56088501
3 F e 36432218
4 F i 3693024
5 F o 403120
6 F u 85508
7 M NA 147082250
8 M e 14341114
9 M o 4041190
10 M a 1844041
11 M i 753311
12 M u 75135
There are simple regular expressions and complicated ones. All of them look foreign until you learn how to read them.
There are many regular expression tutorials on the Internet, for instance this interactive one. The basic structure is outlined here.
Some basics:
.
means “any character.”b
, means just that character.[aeiou]
means any of those characters. (So, [aeiou]
is a pattern describing a vowel.)^
inside square brackets means “any except these.” So, a consonant is [^aeiou]
M[aeiou]
means a capital M followed by a lower-case vowel.+
means “zero or more times.”?
means “zero or one times.”*
means “one or more times.”{2}
means “exactly two times.” Similarly, {2,5}
means between two and five times, {6,}
means six times or more.[aeiou]{2}
means “exactly two vowels in a row.”
^
at the beginning of a regular expression means “the start of the string”$
at the end means “the end of the string.”