This is a breif description of the process required to fetch data from a sample data table in Google’s BigQuery database. The idea of looking here for data came from a blog post which pulled in births by day of year from one of the data tables.

R Notes

The following packaes are required for the code below.

require(dplyr)
require(mosaic)
require(bigrquery)
require(magrittr)
require(lubridate)

The code querying Google BigQuery data needs to run in an interactive session, so it isn’t runing live while this is being knit. Here’s the session info:

sessionInfo()
## R version 3.1.0 Patched (2014-06-19 r65979)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] lubridate_1.3.3 magrittr_1.0.1  bigrquery_0.1   mosaic_0.9-1   
## [5] ggplot2_1.0.0   lattice_0.20-29 car_2.0-20      dplyr_0.2.0.99 
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1      cluster_1.15.2      colorspace_1.2-4   
##  [4] digest_0.6.4        evaluate_0.5.5      formatR_0.10       
##  [7] Formula_1.1-1       ggdendro_0.1-14     grid_3.1.0         
## [10] gridExtra_0.9.1     gtable_0.1.2        Hmisc_3.14-4       
## [13] htmltools_0.2.4     httr_0.3            jsonlite_0.9.8     
## [16] knitr_1.6           latticeExtra_0.6-26 MASS_7.3-33        
## [19] memoise_0.2.1       munsell_0.4.2       nnet_7.3-8         
## [22] parallel_3.1.0      plyr_1.8.1          proto_0.3-10       
## [25] RColorBrewer_1.0-5  Rcpp_0.11.2         RCurl_1.95-4.1     
## [28] reshape2_1.4        rmarkdown_0.2.47    scales_0.2.4       
## [31] splines_3.1.0       stringr_0.6.2       survival_2.37-7    
## [34] tools_3.1.0         yaml_2.1.13

Public data available at Google BigQuery

Google provides a few large data sets in their sample database.

Name Summary # Rows
gsod Samples from US weather stations since 1929 115M
mlab Measurement data of broadband connection performance 240B
natality Birth information for the United States from 1969 to 2008 68M
shakespeare Word index for works of Shakespeare 164K
wikipedia Revision information for Wikipedia articles 314M

Code book for Natality data

The code books are availble for all of these data sets (as are some example queries). For example, the natality code book is available at https://developers.google.com/bigquery/docs/dataset-natality.

Some notes from the code book:

Grabbing a Birthdays data set

Signing up for acces

Before you can grab data, you need to sign up for access at https://developers.google.com/bigquery/sign-up and create (at least one) project. When you make your first query, you will be asked to authenticate, but the authentication information can be cached so that you only have to do that step once.

Fetching via an SQL query

The following code fetches the desired data from Google.

queryString <- paste(
  "SELECT year, month, day, state, sum(record_weight) AS births",
  "FROM natality",        # name of table
  "WHERE year < 1989",    # no day-level data after this
  "group by year, month, day, state")

Birthdays.download <-
  query_exec("publicdata",     # project
             "samples",        # database
             queryString,      # query
             billing="mosaic-2014",    # project to bill to (this is free)
             max_pages=Inf)   # don't limit the number of rows

Post-processing with dplyr and lubridate

Birthdays <- 
  Birthdays.download %>%
    group_by(year, month, day, state) %>% 
    summarise(births=sum(births)) %>%
    mutate(
      date = ymd(paste(year,month,day, sep="/")),
      wday = wday(date, label=TRUE, abbr=TRUE)
      ) %>%
    select(state, year, month, day, date, wday, births) %>%
  as.data.frame()

Sanity checks

There were some warnings (hidden in this output) about dates that could not be converted. Let’s see what happened.

Birthdays %>% 
  filter(is.na(date)) %>% 
  mosaic::sample(5)
##     state year month day date wday births orig.ids
## 598    DC 1971     4  99 <NA> <NA>      2      598
## 910    GA 1974     1  99 <NA> <NA>      8      910
## 778    IL 1972    10  99 <NA> <NA>      2      778
## 472    IN 1970     9  99 <NA> <NA>      4      472
## 943    CA 1974     6  99 <NA> <NA>      4      943

Those 99th days of the month look suspicious. It isn’t clear exactly what they code, but all of the bad dates come from these day codes that are 99.

Birthdays %>% 
  filter(is.na(date)) %>%
  group_by(year) %>%
  summarise( n=n(), min.day=min(day), max.day=max(day))
## Source: local data frame [20 x 4]
## 
##    year   n min.day max.day
## 1  1969 254      99      99
## 2  1970 316      99      99
## 3  1971 128      99      99
## 4  1972 102      99      99
## 5  1973 106      99      99
## 6  1974  93      99      99
## 7  1975  62      99      99
## 8  1976  64      99      99
## 9  1977  29      99      99
## 10 1978  28      99      99
## 11 1979  25      99      99
## 12 1980  23      99      99
## 13 1981  17      99      99
## 14 1982  24      99      99
## 15 1983  31      99      99
## 16 1984  23      99      99
## 17 1985   8      99      99
## 18 1986  12      99      99
## 19 1987   5      99      99
## 20 1988   7      99      99
tally( is.na(date) ~ (day==99), data=Birthdays, format="count") 
##            (day == 99)
## is.na(date)   TRUE  FALSE
##       TRUE    1357      0
##       FALSE      0 372864

Let’s check how the number of bad codes varies by state

Birthdays %>% 
  group_by(state) %>%
  summarise(n=n(), bad=sum(day > 50)) %>%
  arrange(bad/n)
## Source: local data frame [51 x 3]
## 
##    state    n bad
## 1     VT 7306   1
## 2     MT 7307   2
## 3     NE 7307   2
## 4     WY 7307   2
## 5     SD 7309   4
## 6     AK 7311   5
## 7     DE 7314   7
## 8     RI 7313   8
## 9     ME 7318   9
## 10    KS 7321  10
## 11    FL 7317  10
## 12    ID 7316  10
## 13    CO 7315  10
## 14    ND 7315  10
## 15    OR 7318  11
## 16    NH 7320  12
## 17    WA 7318  12
## 18    AR 7324  14
## 19    UT 7321  14
## 20    NV 7323  16
## 21    HI 7322  16
## 22    WV 7327  17
## 23    IA 7323  17
## 24    TN 7326  18
## 25    NC 7325  18
## 26    MO 7330  21
## 27    OK 7327  21
## 28    IN 7336  25
## 29    LA 7334  25
## 30    MI 7351  28
## 31    MD 7339  28
## 32    MN 7344  29
## 33    SC 7344  30
## 34    WI 7342  31
## 35    VA 7341  31
## 36    AZ 7347  37
## 37    NM 7346  38
## 38    IL 7353  39
## 39    AL 7351  39
## 40    NJ 7361  40
## 41    MS 7350  40
## 42    NY 7378  45
## 43    OH 7370  51
## 44    MA 7367  52
## 45    DC 7363  52
## 46    KY 7366  53
## 47    TX 7384  54
## 48    CT 7366  54
## 49    GA 7376  62
## 50    PA 7406  76
## 51    CA 7426 101

Let’s filter those 99’s out

Birthdays <-
  Birthdays %>% filter(day < 99)
#### Saving in R native format for later use.
save(Birthdays, file="Birthdays.rda")