Project: Street or Road?

People’s addresses involve streets, lanes, courts, avenues, and so on. How many such road-related words are in common use?

address
PO BOX 90081
314 SOUTHERN CROSS LANE
2725 JOYNER ROAD
PO BOX 127
258 LYNNBANK ESTATES RD
PO BOX 27032
NCSU BOX 16010
PO BOX 40535
… and so on for 15,483 rows altogether.

In answering this question, you would presumably want to look at lots of addresses and extract the road-related term. You could do this by eye, reading down a list of a few hundred or thousand addresses. But if you want to do it on a really large scale, a city or state or country, you would want some automated help, for instance, a computer program that discards the sorts of entries you have already identified to give a greater concentration of unidentified terms. In this activity, you’re going to build such a program.

Some resources:

  1. The file http://tiny.cc/dcf/street-addresses.csv contains about 15000 street addresses of registered voters in Wake County, North Carolina.
  2. The file http://tiny.cc/dcf/CMS_ProvidersSimple.rds has street address of about 900,000 medicare service providers. Download the file to save it on your own system, then read it in under a convenient name.
download.file(url="http://tiny.cc/dcf/CMS_ProvidersSimple.rds",
              destfile = "YourNameForTheFile.rds")
DataTable <- readRDS("YourNameForTheFile.rds")

To solve such problems, start by looking at a few dozen of the addresses to familiarize yourself with common patterns. With those few dozen

  1. In everyday language, describe a pattern that you think will identify the information you are looking for.
  2. Translate (1) into the form of a regular expression.
  3. Filter to retain the cases that match the expression. Hint: filter() and grepl() are useful for this.
  4. Filter to retain the cases that do not match the expression.
  5. Examine the results of (3) and (4) to identify shortcomings in your patterns.
  6. Improve or extend the pattern to deal with the mistaken cases.
  7. Repeat until satisfied.
  8. Put extraction parentheses around the parts of the regular expression that contain the info you want.

17.13 Solved Example:

Suppose you wanted to extract the PO Box number from an address.

Read the street address data and pull out a sample of a few dozen cases.

Addresses <- read_csv("http://tiny.cc/dcf/street-addresses.csv")
Sample <- 
  Addresses %>%
  sample_n(size = 50)

Following each of the steps listed above:

  1. The PO Box cases tend to have a substring “PO”.
  2. The regular expression for “PO” is simply "PO".
  3. Find some cases that match:

    Matches <- 
      Sample %>%
      filter(grepl(pattern = "PO", address))
    address
    PO BOX 90081
    PO BOX 127
    PO BOX 27032
    PO BOX 40535
    PO BOX 592
    PO BOX 1444
    … and so on for 30 rows altogether.
  4. Find cases that don’t match:

    Dont <- 
      Sample %>%
      filter( ! grepl(pattern = "PO", address))
    address
    314 SOUTHERN CROSS LANE
    2725 JOYNER ROAD
    258 LYNNBANK ESTATES RD
    NCSU BOX 16010
    2117 OAK HOLLOW DRIVE
    NCSU BOX 15813
    … and so on for 20 rows altogether.
  5. Find any cases in the Matches that shouldn’t be there (none so far in the excerpt shown). Find any cases in Dont that should have matched, for example we want to try and capture the “NCSU BOX” results among our matches.

  1. It looks like “BOX” is a better pattern. Since the box number is wanted, the regex should include an identifier for the number inside extraction parentheses. So let’s try "BOX\\s+(\\d+)" Note the double slashes, \\, in \\s and \\d in the pattern. Ordinarily, \ is a special character in R character strings used to designate special characters like new-line \n or tab \t. The double \\ means, “just an ordinary slash, please.” Confusing. But whenever characters are used to signal something special, you have to take an extra step to say that you don’t want the special meaning.

    pattern <- "BOX\\s+(\\d+)"
    
    Matches <- 
      Sample %>% 
      filter(grepl(pattern, address))
    
    Dont <-
      Sample %>%
      filter( ! grepl(pattern, address))
    Dont
    address
    314 SOUTHERN CROSS LANE
    2725 JOYNER ROAD
    258 LYNNBANK ESTATES RD
    2117 OAK HOLLOW DRIVE
    1923 MILBURNIE RD
    8935 SUMMER CLUB ROAD
    … and so on for 10 rows altogether.

The result seems satisfactory.

So, use tidyr::extract() to pull out the part of the pattern identified by extraction parentheses. Be sure to review the regular expression pattern in use here to identify which part of the pattern we intend to extract.

BoxNumbers <- 
  Sample %>%
  filter(grepl(pattern, address)) %>%
  tidyr::extract(address, into = "boxnum", regex = pattern)
boxnum
90081
127
27032
16010
40535
592
… and so on for 40 rows altogether.

Note that tidyr::extract() should be given only those cases that match the regular expression, so filter() is applied before tidyr::extract()

17.14 Back to the Streets

Street endings (e.g. “ST”, “LANE”) are often found at the end of the address string. Use this as a starting point to find the most common endings.

Once you have a set of specific street endings, you can use the regex “or” symbol, e.g. "(ST|RD|ROAD)". The parentheses are not incidental. They are there to mark a pattern that you want to extract. In this case, in addition to knowing that there is a ST or RD or ROAD in an address, you want to know which one of those possibilities it is so that you can count the occurance of each of the possibilities.

To find street endings that aren’t in your set, you can filter out the street endings or non-street addresses you already know about.

Your turn: Read the following R statements. Next to each line, give a short explanation of what the line contributes to the task. For each of the regexes, explain in simple everyday language what pattern is being matched.

pattern <- "(ST|RD|ROAD)"
LeftOvers <-
  Addresses %>% 
  filter( ! grepl(pattern, address),
          ! grepl("\\sAPT|UNIT\\s[\\d]+$", address),
          ! grepl(" BOX ", address)
          )
address
2117 MARINER CIRCLE
101 EPPING WAY
04-I ROBIN CIRCLE
NCSU B0X 15637
4719 BROWN TRAIL
130 THE WINERY
… and so on for 2,411 rows altogether.

For each set of patterns that you identify, compute the LeftOvers. Examine them visually to find new street endings to add to the pattern, e.g. LANE.

When you have this working on the small sample, use a larger sample and, eventually, the whole data set. It’s practically impossible to find a method that will work perfectly on new data, but do the best you can.

Your turn: In your report, implement your method and explain how it works, line by line. Present your result: how many addresses there are of each kind of road word?

17.15 For the professional …

Breaking addresses into their components is a common task. People who work on this problem intensively sometimes publish their regular expressions. Here’s one from Ross Hammer published at http://regexlib.com/Search.aspx?k=street

^\s*((?:(?:\d+(?:\x20+\w+\.?)+(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT
|CT|CIRCLE|LANE|LN|BOULEVARD|BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|
(?:General\x20+Delivery)|(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|
FL|HNGR|LOT|PIER|RM|S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|
(?:BSMT|FRNT|LBBY|LOWR|OFC|PH|REAR|SIDE|UPPR))?)\,?\s+((?:(?:\d+(?:\x20+\w+\.?)+
(?:(?:\x20+STREET|ST|DRIVE|DR|AVENUE|AVE|ROAD|RD|LOOP|COURT|CT|CIRCLE|LANE|LN|BOULEVARD|
BLVD)\.?)?)|(?:(?:P\.\x20?O\.|P\x20?O)\x20*Box\x20+\d+)|(?:General\x20+Delivery)|
(?:C[\\\/]O\x20+(?:\w+\x20*)+))\,?\x20*(?:(?:(?:APT|BLDG|DEPT|FL|HNGR|LOT|PIER|RM|
S(?:LIP|PC|T(?:E|OP))|TRLR|UNIT|\x23)\.?\x20*(?:[a-zA-Z0-9\-]+))|(?:BSMT|FRNT|LBBY|
LOWR|OFC|PH|REAR|SIDE|UPPR))?)?\,?\s+((?:[A-Za-z]+\x20*)+)\,\s+(A[LKSZRAP]|C[AOT]|
D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|
S[CD]|T[NX]|UT|V[AIT]|W[AIVY])\s+(\d+(?:-\d+)?)\s*$