Statistics students (and instructors) need experience wrestling with large, messy, complex, challenging data sets, for which there is no obvious goal or specially-curated statistical method. In this example, we consider a case study from a subset of the 180 million record Airline Delays dataset (see http://stat-computing.org/dataexpo/2009 and http://www.amherst.edu/~nhorton/precursors) that includes n=336,776 domestic commercial flights originating in New York City area airports (Newark, JFK, and LaGuardia) in 2013. These data are made available through Hadley Wickham’s nycflights13 package on CRAN, which includes five dataframes. See the appendix for a list of the first few observations in each of these tables.

require(mosaic); require(nycflights13)

Students can use this dataset to address questions that they find real and relevant. (It is not hard to find motivation for investigating patterns of flight delays. Ask students: have you ever been stuck in an airport because your flight was delayed or cancelled and wondered if you could have predicted the delay if you’d had more data?) This dataset is attractive because it is more similar to what analysts actually see in the wild than what is typically taught in the introductory statistics classroom.

Flights to San Francisco Bay

We start with an analysis focused on three destination airports in the San Francisco Bay area (OAK, SFO, and SJC).

filter(airports, faa %in% c('SFO', 'OAK', 'SJC'))   
## Source: local data frame [3 x 7]
## 
##   faa                          name  lat  lon alt tz dst
## 1 OAK     Metropolitan Oakland Intl 37.7 -122   9 -8   A
## 2 SFO            San Francisco Intl 37.6 -122  13 -8   A
## 3 SJC Norman Y Mineta San Jose Intl 37.4 -122  62 -8   A

How many flights are there to each airport in January, 2013?

airportcounts <- flights %>% 
   filter(dest %in% c('SFO', 'OAK', 'SJC')) %>%
   group_by(year, month, dest) %>%
   summarise(count = n()) 
filter(airportcounts, month==1)
## Source: local data frame [3 x 4]
## Groups: year, month
## 
##   year month dest count
## 1 2013     1  OAK    20
## 2 2013     1  SFO   889
## 3 2013     1  SJC    20

Almost all are to San Francisco international (SFO). Let’s take a closer look at what carriers service this route.

airlines <- mutate(airlines, name=as.character(name), carrier=as.character(carrier))
sfoflights <- inner_join(filter(flights, dest=="SFO"), airlines)
## Joining by: "carrier"
tally(~ name, margin=TRUE, data=sfoflights)
## 
## American Airlines Inc.   Delta Air Lines Inc.        JetBlue Airways 
##                   1422                   1858                   1035 
##  United Air Lines Inc.         Virgin America                  Total 
##                   6819                   2197                  13331
tally(~ name, format="prop", data=sfoflights)
## 
## American Airlines Inc.   Delta Air Lines Inc.        JetBlue Airways 
##                 0.1067                 0.1394                 0.0776 
##  United Air Lines Inc.         Virgin America 
##                 0.5115                 0.1648

United is the largest carrier (it accounts for more than half of the flights.

Are there different delays by carrier?

favstats(arr_delay ~ name, data=sfoflights)
##                   .group min  Q1 median    Q3  max  mean   sd    n missing
## 1 American Airlines Inc. -63 -21     -4 18.00 1007  9.68 58.7 1398      24
## 2   Delta Air Lines Inc. -69 -28    -13  3.25  561 -5.88 39.9 1848      10
## 3        JetBlue Airways -64 -23     -7 14.00  445  3.58 46.1 1020      15
## 4  United Air Lines Inc. -73 -21     -6 13.00  422  3.14 42.0 6728      91
## 5         Virgin America -86 -26    -12  7.00  676  3.58 60.4 2179      18

The “average” results (median) is that flights arrive a few minutes early. And even the 3rd quartile or the mean are relatively modest delays (all less than 20 minutes). But the maximum delays can be large.

Those missing values are likely cancelled flights. What month did they tend to occur?

tally(~ month, margin=TRUE, data=filter(sfoflights, is.na(arr_delay)))
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
##     4    29    11    11     5    11    39    11     7     7     3    20 
## Total 
##   158

These seem to be most common in July, February, and December.

How should the cancelled flights be handled? Perhaps they could be recoded as 4 hour (240 minute) delays. Let’s see what a graphical description of the data looks like once we recode them.

sfoflights <- mutate(sfoflights, realdelay = ifelse(is.na(arr_delay), 240, arr_delay))
densityplot(~ realdelay, group=name, auto.key=TRUE, xlim=c(-120, 300), data=sfoflights)

Is there seasonality to the number of flights?

We can consider whether the number of flights changes month by month.

library(lubridate)
sfocounts <- filter(airportcounts, dest=="SFO") %>%
  mutate(Date = ymd(paste(year, "-", month, "-01", sep="")))
xyplot(count ~ Date, type=c("p","l"), lwd=2, 
       xlab="Year",
       ylab="Number of flights\nto SFO per month", data=sfocounts)

We observe that there are some interesting patterns over the course of the year for SFO.

Weather

The nycflights13 package in R includes other data scraped from the Internet (e.g. detailed weather information). We can display the temperature (in degrees Fahrenheit) on New Year’s Day, 2013.

head(weather)
## Source: local data frame [6 x 14]
## Groups: month, day, hour
## 
##   origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## 1    EWR 2013     1   1    0 37.0 21.9  54.0      230       10.4      11.9
## 2    EWR 2013     1   1    1 37.0 21.9  54.0      230       13.8      15.9
## 3    EWR 2013     1   1    2 37.9 21.9  52.1      230       12.7      14.6
## 4    EWR 2013     1   1    3 37.9 23.0  54.5      230       13.8      15.9
## 5    EWR 2013     1   1    4 37.9 24.1  57.0      240       15.0      17.2
## 6    EWR 2013     1   1    6 39.0 26.1  59.4      270       10.4      11.9
## Variables not shown: precip (dbl), pressure (dbl), visib (dbl)
xyplot(temp ~ hour, type=c("p", "l"), 
       ylab="Temperature on\nNew Year's Day", data=filter(weather, month==1 & day==1))

Let’s take a look at daily averages for delays as well as total precipation and maximum wind speed.

avgdelay <- flights %>%
  group_by(month, day) %>%
  filter(month < 13) %>%
  summarise(avgdelay = mean(arr_delay, na.rm=TRUE)) 
precip <- weather %>%
  group_by(month, day) %>%
  filter(month < 13) %>%
  summarise(totprecip = sum(precip), maxwind = max(wind_speed))
precip <- mutate(precip, anyprecip = ifelse(totprecip==0, "No", "Yes"))
merged <- left_join(avgdelay, precip, by=c("day", "month"))
head(merged)
## Source: local data frame [6 x 6]
## Groups: month
## 
##   month day avgdelay totprecip maxwind anyprecip
## 1     1   1    12.65         0    16.1        No
## 2     1   2    12.69         0    18.4        No
## 3     1   3     5.73         0    11.5        No
## 4     1   4    -1.93         0    24.2        No
## 5     1   5    -1.53         0    18.4        No
## 6     1   6     4.24         0    15.0        No

A dramatic outlier emerges: windspeeds of 1000 mph are not common!

favstats(~ maxwind, data=merged)
##   min   Q1 median   Q3  max mean   sd   n missing
##  5.75 12.7   16.1 19.6 1048 19.3 54.4 363       2
filter(merged, maxwind > 1000)
## Source: local data frame [1 x 6]
## Groups: month
## 
##   month day avgdelay totprecip maxwind anyprecip
## 1     2  12    -2.14         0    1048        No

Let’s remove this outlier and consider the association between any precipiation and average delays.

merged <- filter(merged, maxwind < 1000)
bwplot(avgdelay ~ anyprecip, main="Association of delay with any precipitation", data=merged)

Precipitation seems to be associated with delays.

xyplot(avgdelay ~ maxwind, type=c("p", "smooth"), data=merged)

Max windspeed also seems to be associated with delays.

xyplot(avgdelay ~ maxwind, groups=anyprecip, auto.key=TRUE, type=c("p", "smooth"), data=merged)

After stratifying by precipitation status, we see that windspeed does not appear to be a major determinant of delays.

Your assignment

Your mission: calculate the average delay by destination airport and origin airport for all flights in July, 2013 (hint, use filter(), group_by(), and summarise()).

destdelay <- flights %>%
  filter(month==7) %>%
  mutate(realdelay = ifelse(is.na(arr_delay), 240, arr_delay)) %>%
  group_by(dest, origin) %>%
  summarise(count = n(), distance = mean(distance), avgdelay = mean(realdelay)) %>% 
  filter(count > 100) %>%
  ungroup()
filter(destdelay, dest=="DCA")
## Source: local data frame [3 x 5]
## 
##   dest origin count distance avgdelay
## 1  DCA    EWR   126      199     42.9
## 2  DCA    JFK   281      213     44.0
## 3  DCA    LGA   406      214     31.4
arrange(destdelay, desc(avgdelay))
## Source: local data frame [88 x 5]
## 
##    dest origin count distance avgdelay
## 1   IAD    LGA   152      229     50.2
## 2   CVG    EWR   240      569     49.1
## 3   RIC    EWR   146      277     46.0
## 4   DCA    JFK   281      213     44.0
## 5   STL    EWR   216      872     43.9
## 6   CLE    LGA   163      419     43.5
## 7   IAD    JFK   238      228     43.3
## 8   JAX    JFK   124      828     43.0
## 9   DCA    EWR   126      199     42.9
## 10  ATL    LGA   847      762     42.5
## ..  ...    ...   ...      ...      ...
xyplot(avgdelay ~ distance, type=c("p", "smooth"), data=destdelay)

xyplot(avgdelay ~ distance, groups=origin, auto.key=TRUE, type=c("p", "smooth"), data=destdelay)

Closing thoughts and further resources

Horton NJ, Baumer BS, and Wickham H. Setting the stage for data science: integration of data management skills in introductory and second courses in statistics, CHANCE, 28(2):40-50, http://chance.amstat.org/2015/04/setting-the-stage.

Wickham H. (2011). ASA 2009 Data Expo, Journal of Computational and Graphical Statistics, 20(2):281-283.

Appendix

Here is more information on the structure and first few rows of each of the datasets.

airlines
## Source: local data frame [16 x 2]
## 
##    carrier                        name
## 1       9E           Endeavor Air Inc.
## 2       AA      American Airlines Inc.
## 3       AS        Alaska Airlines Inc.
## 4       B6             JetBlue Airways
## 5       DL        Delta Air Lines Inc.
## 6       EV    ExpressJet Airlines Inc.
## 7       F9      Frontier Airlines Inc.
## 8       FL AirTran Airways Corporation
## 9       HA      Hawaiian Airlines Inc.
## 10      MQ                   Envoy Air
## 11      OO       SkyWest Airlines Inc.
## 12      UA       United Air Lines Inc.
## 13      US             US Airways Inc.
## 14      VX              Virgin America
## 15      WN      Southwest Airlines Co.
## 16      YV          Mesa Airlines Inc.
airports
## Source: local data frame [1,397 x 7]
## 
##    faa                           name  lat    lon  alt tz dst
## 1  04G              Lansdowne Airport 41.1  -80.6 1044 -5   A
## 2  06A  Moton Field Municipal Airport 32.5  -85.7  264 -5   A
## 3  06C            Schaumburg Regional 42.0  -88.1  801 -6   A
## 4  06N                Randall Airport 41.4  -74.4  523 -5   A
## 5  09J          Jekyll Island Airport 31.1  -81.4   11 -4   A
## 6  0A9 Elizabethton Municipal Airport 36.4  -82.2 1593 -4   A
## 7  0G6        Williams County Airport 41.5  -84.5  730 -5   A
## 8  0G7  Finger Lakes Regional Airport 42.9  -76.8  492 -5   A
## 9  0P2   Shoestring Aviation Airfield 39.8  -76.6 1000 -5   U
## 10 0S9          Jefferson County Intl 48.1 -122.8  108 -8   A
## .. ...                            ...  ...    ...  ... .. ...
planes
## Source: local data frame [3,322 x 9]
## 
##    tailnum year                    type     manufacturer     model engines
## 1   N10156 2004 Fixed wing multi engine          EMBRAER EMB-145XR       2
## 2   N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 3   N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 4   N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 5   N10575 2002 Fixed wing multi engine          EMBRAER EMB-145LR       2
## 6   N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 7   N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 8   N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 9   N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 10  N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## ..     ...  ...                     ...              ...       ...     ...
## Variables not shown: seats (int), speed (int), engine (chr)
flights
## Source: local data frame [336,776 x 16]
## 
##    year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1  2013     1   1      517         2      830        11      UA  N14228
## 2  2013     1   1      533         4      850        20      UA  N24211
## 3  2013     1   1      542         2      923        33      AA  N619AA
## 4  2013     1   1      544        -1     1004       -18      B6  N804JB
## 5  2013     1   1      554        -6      812       -25      DL  N668DN
## 6  2013     1   1      554        -4      740        12      UA  N39463
## 7  2013     1   1      555        -5      913        19      B6  N516JB
## 8  2013     1   1      557        -3      709       -14      EV  N829AS
## 9  2013     1   1      557        -3      838        -8      B6  N593JB
## 10 2013     1   1      558        -2      753         8      AA  N3ALAA
## ..  ...   ... ...      ...       ...      ...       ...     ...     ...
## Variables not shown: flight (int), origin (chr), dest (chr), air_time
##   (dbl), distance (dbl), hour (dbl), minute (dbl)
weather
## Source: local data frame [8,719 x 14]
## Groups: month, day, hour
## 
##    origin year month day hour temp dewp humid wind_dir wind_speed
## 1     EWR 2013     1   1    0 37.0 21.9  54.0      230      10.36
## 2     EWR 2013     1   1    1 37.0 21.9  54.0      230      13.81
## 3     EWR 2013     1   1    2 37.9 21.9  52.1      230      12.66
## 4     EWR 2013     1   1    3 37.9 23.0  54.5      230      13.81
## 5     EWR 2013     1   1    4 37.9 24.1  57.0      240      14.96
## 6     EWR 2013     1   1    6 39.0 26.1  59.4      270      10.36
## 7     EWR 2013     1   1    7 39.0 27.0  61.6      250       8.06
## 8     EWR 2013     1   1    8 39.0 28.0  64.4      240      11.51
## 9     EWR 2013     1   1    9 39.9 28.0  62.2      250      12.66
## 10    EWR 2013     1   1   10 39.0 28.0  64.4      260      12.66
## ..    ...  ...   ... ...  ...  ...  ...   ...      ...        ...
## Variables not shown: wind_gust (dbl), precip (dbl), pressure (dbl), visib
##   (dbl)