This is a breif description of the process required to fetch data from a sample data table in Google’s BigQuery database. The idea of looking here for data came from a blog post which pulled in births by day of year from one of the data tables.
The following packaes are required for the code below.
require(dplyr)
require(mosaic)
require(bigrquery)
require(magrittr)
require(lubridate)
The code querying Google BigQuery data needs to run in an interactive session, so it isn’t runing live while this is being knit. Here’s the session info:
sessionInfo()
## R version 3.1.0 Patched (2014-06-19 r65979)
## Platform: x86_64-apple-darwin13.1.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.3.3 magrittr_1.0.1 bigrquery_0.1 mosaic_0.9-1
## [5] ggplot2_1.0.0 lattice_0.20-29 car_2.0-20 dplyr_0.2.0.99
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 cluster_1.15.2 colorspace_1.2-4
## [4] digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [7] Formula_1.1-1 ggdendro_0.1-14 grid_3.1.0
## [10] gridExtra_0.9.1 gtable_0.1.2 Hmisc_3.14-4
## [13] htmltools_0.2.4 httr_0.3 jsonlite_0.9.8
## [16] knitr_1.6 latticeExtra_0.6-26 MASS_7.3-33
## [19] memoise_0.2.1 munsell_0.4.2 nnet_7.3-8
## [22] parallel_3.1.0 plyr_1.8.1 proto_0.3-10
## [25] RColorBrewer_1.0-5 Rcpp_0.11.2 RCurl_1.95-4.1
## [28] reshape2_1.4 rmarkdown_0.2.47 scales_0.2.4
## [31] splines_3.1.0 stringr_0.6.2 survival_2.37-7
## [34] tools_3.1.0 yaml_2.1.13
Google provides a few large data sets in their sample database.
Name | Summary | # Rows |
---|---|---|
gsod | Samples from US weather stations since 1929 | 115M |
mlab | Measurement data of broadband connection performance | 240B |
natality | Birth information for the United States from 1969 to 2008 | 68M |
shakespeare | Word index for works of Shakespeare | 164K |
wikipedia | Revision information for Wikipedia articles | 314M |
The code books are availble for all of these data sets (as are some example queries). For example, the natality code book is available at https://developers.google.com/bigquery/docs/dataset-natality.
Some notes from the code book:
record_weight
(1 or 2) indicates whether the data come from a region that reported 50% of births (2) or all births (1). This is important if trying to count/estimate the total number of births on a given day.Before you can grab data, you need to sign up for access at https://developers.google.com/bigquery/sign-up and create (at least one) project. When you make your first query, you will be asked to authenticate, but the authentication information can be cached so that you only have to do that step once.
The following code fetches the desired data from Google.
queryString <- paste(
"SELECT year, month, day, state, sum(record_weight) AS births",
"FROM natality", # name of table
"WHERE year < 1989", # no day-level data after this
"group by year, month, day, state")
Birthdays.download <-
query_exec("publicdata", # project
"samples", # database
queryString, # query
billing="mosaic-2014", # project to bill to (this is free)
max_pages=Inf) # don't limit the number of rows
Birthdays <-
Birthdays.download %>%
group_by(year, month, day, state) %>%
summarise(births=sum(births)) %>%
mutate(
date = ymd(paste(year,month,day, sep="/")),
wday = wday(date, label=TRUE, abbr=TRUE)
) %>%
select(state, year, month, day, date, wday, births) %>%
as.data.frame()
There were some warnings (hidden in this output) about dates that could not be converted. Let’s see what happened.
Birthdays %>%
filter(is.na(date)) %>%
mosaic::sample(5)
## state year month day date wday births orig.ids
## 598 DC 1971 4 99 <NA> <NA> 2 598
## 910 GA 1974 1 99 <NA> <NA> 8 910
## 778 IL 1972 10 99 <NA> <NA> 2 778
## 472 IN 1970 9 99 <NA> <NA> 4 472
## 943 CA 1974 6 99 <NA> <NA> 4 943
Those 99th days of the month look suspicious. It isn’t clear exactly what they code, but all of the bad dates come from these day codes that are 99.
Birthdays %>%
filter(is.na(date)) %>%
group_by(year) %>%
summarise( n=n(), min.day=min(day), max.day=max(day))
## Source: local data frame [20 x 4]
##
## year n min.day max.day
## 1 1969 254 99 99
## 2 1970 316 99 99
## 3 1971 128 99 99
## 4 1972 102 99 99
## 5 1973 106 99 99
## 6 1974 93 99 99
## 7 1975 62 99 99
## 8 1976 64 99 99
## 9 1977 29 99 99
## 10 1978 28 99 99
## 11 1979 25 99 99
## 12 1980 23 99 99
## 13 1981 17 99 99
## 14 1982 24 99 99
## 15 1983 31 99 99
## 16 1984 23 99 99
## 17 1985 8 99 99
## 18 1986 12 99 99
## 19 1987 5 99 99
## 20 1988 7 99 99
tally( is.na(date) ~ (day==99), data=Birthdays, format="count")
## (day == 99)
## is.na(date) TRUE FALSE
## TRUE 1357 0
## FALSE 0 372864
Let’s check how the number of bad codes varies by state
Birthdays %>%
group_by(state) %>%
summarise(n=n(), bad=sum(day > 50)) %>%
arrange(bad/n)
## Source: local data frame [51 x 3]
##
## state n bad
## 1 VT 7306 1
## 2 MT 7307 2
## 3 NE 7307 2
## 4 WY 7307 2
## 5 SD 7309 4
## 6 AK 7311 5
## 7 DE 7314 7
## 8 RI 7313 8
## 9 ME 7318 9
## 10 KS 7321 10
## 11 FL 7317 10
## 12 ID 7316 10
## 13 CO 7315 10
## 14 ND 7315 10
## 15 OR 7318 11
## 16 NH 7320 12
## 17 WA 7318 12
## 18 AR 7324 14
## 19 UT 7321 14
## 20 NV 7323 16
## 21 HI 7322 16
## 22 WV 7327 17
## 23 IA 7323 17
## 24 TN 7326 18
## 25 NC 7325 18
## 26 MO 7330 21
## 27 OK 7327 21
## 28 IN 7336 25
## 29 LA 7334 25
## 30 MI 7351 28
## 31 MD 7339 28
## 32 MN 7344 29
## 33 SC 7344 30
## 34 WI 7342 31
## 35 VA 7341 31
## 36 AZ 7347 37
## 37 NM 7346 38
## 38 IL 7353 39
## 39 AL 7351 39
## 40 NJ 7361 40
## 41 MS 7350 40
## 42 NY 7378 45
## 43 OH 7370 51
## 44 MA 7367 52
## 45 DC 7363 52
## 46 KY 7366 53
## 47 TX 7384 54
## 48 CT 7366 54
## 49 GA 7376 62
## 50 PA 7406 76
## 51 CA 7426 101
Let’s filter those 99’s out
Birthdays <-
Birthdays %>% filter(day < 99)
#### Saving in R native format for later use.
save(Birthdays, file="Birthdays.rda")