Background

High costs for medical care have been a national concern for more than a decade. There is a wealth of data about medical spending; insurance companies and the government pay most medical costs and keep extensive and detailed records. Nonetheless, until recently there has been little data available to journalists and other citizens. The insurance and government data systems are complex and few other organizations have had the resources to study the data. That has changed in the last few years with “open data” programs in government.

In May 2013, the Centers for Medicare and Medicaid Services (CMS) released data on the price Medicare fee-for-service health services at hospitals around the US in year 2011.

This attracted press attention. For example, a headline in the New York Times read “Hospital Billing Varies Wildly, Government Data Shows”.

NYT Map. Is this showing up with a caption?

NYT Map. Is this showing up with a caption?

A map with the article shows the geographic disparities in hospital charges.

Studying the Medicare Data

The national news reports emphasize geographic disparities in the billing for procedures. A citizen-statistician might be interested to dig deeper into these disparities, perhaps with a local focus. Or, she might want to investigate what other factors are associated with varying prices, for instance:

  1. The type of procedure.
  2. The number of patients receiving the procedure at a given hospital.
  3. Age, wealth, and other demographic characteristics of the area.

Accessing the Data

CMS makes the data available in four formats at this this download page with similar pages for years 2011 and 2012. The four formats are

  1. Inpatient charge data, FY 2013, Microsoft Excel
  2. Inpatient charge data, FY 2013, Comma Separated Values
  3. National and State Summaries, Microsoft Excel
  4. National and State Summaries, Comma Separated Values

As you can see, there are two kinds of data: “charge data” and “national and state summaries.” There are also two kinds of data table file formats: Excel and CSV. Usually, as here, the two file formats contain the same data; you only need to use one of them. It’s generally easier to use the CSV format.

A codebook in PDF format is provided under a link labelled methodology. This CMS documentation doesn’t specify in a concise way what a case means. After you load the data into R, you can figure this out.

For convenience the 2011 inpatient data are available in the DataComputing package in the form of three data tables: MedicareCharges, MedicareProviders, and DirectRecoveryGroups. This case study will work with those tables.

Checking the Data

The New York Times says:

The data for 3,300 hospitals, released by the federal Centers for Medicare and Medicaid Services, shows wide variations not only regionally but among hospitals in the same area or city.

A quick summary of the data using commands that students might encounter in an introductory statistics course:

nrow(MedicareCharges)
## [1] 163065
nrow(MedicareProviders)
## [1] 3337

The second number corresponds well with the New York Times description. But why are there 163065 rows if there are only 3337 hospitals? Because there is a separate listing for each of 100 procedures (known as a Direct Recovery Group (DRG)).

nrow(DirectRecoveryGroups)
## [1] 100

If every provider had a listing for every DRG, there would be 3337 times 100 cases in MedicareCharges. There’s only about half that many; a typical provider is listed roughly 50 times.

Loading the data directly from the source

It can be instructive to follow the process of loading the data directly from the Centers for Medicare and Medicaid Studies into R. Again, you don’t need to do this unless you want to experience the process or apply the pattern to a new source of data, for instance the 2013 or other more recent data.

Using a browser, download the file at the link to the “inpatient data” in CSV format. As you’ll see, the result is a file named Inpatient_Data_2013_CSV.zip. This is not actually a .csv file; it is a .zip compressed file. A .zip file is analogous to a suitcase: it can hold several different items in a compressed format suitable for easy transport.1 You “unzip” such a file to gain access to the uncompressed contents.

The Inpatient_Data_2013_CSV.zip contains just one item. Opening the suitcase (that is, “unzipping” it) extracts that item: Medicare_Provider_Charge_Inpatient_DRG100_FY2013.csv.2

Similarly, you can uncompress the “summary” zip file. This will reveal two files:

  1. Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRG_FY2013.csv
  2. Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRGState_FY2013.csv

It turns out that you won’t need the summary files, but look at them anyways.

It’s informative to look at the size of the .csv files. This helps you to evaluate whether or not you will have problems handling the files.

A file of 26.9 MB is easily handled by today’s personal computers. Note that the summary files are much smaller. That makes sense for a summary! The DRG/State summary file is about 50 times larger than the DRG summary file. Why do you think? What is it about states in the US that suggests a number like 50?

You can load the .csv data into R with commands like these:

Inpatients <- readr::read_csv(file.choose())
# Navigate to the CSV file
# And similarly for the two summary files
DRGState <- readr::fread(file.choose())
DRG <- readr::fread(file.choose())

Once you have read in these three data tables, look at their sizes in the Environment tab in RStudio. It’s common for the table stored in R to be much smaller than the CSV file it was created from, in this case 13.2 MB versus 26.9 MB for the inpatient data.

EXERCISE: Pack your suitcase neatly

Download the .zip files for both year 2011 and 2013. The contents of the files are arranged differently in the two years. (This is not good!) Explain how the contents (that is, the files, their names, and their location) are different.

Sanity Checks

It’s useful to compare the data to your expectations for what values should be. This can help reveal your misconceptions about what the data are. So ask some easy questions for which you already have an approximate answer.

How many procedures?

There are about 300 million people in the US. It’s reasonable to expect that millions of inpatient procedures are performed, only a fraction of which are covered by Medicare or Medicaid. The totalDischarges variable gives the number of people in each DRG for each hospital. Add these up to get the total number of procedures listed in the data.

R provides several different “dialects” for such calculations. In “base R” you might write

with(data = MedicareCharges, sum(totalDischarges))
## [1] 6975318

With the mosaic package you can write this more concisely (and do more complex calculations such as breaking down the total by some other variable):

library(mosaic)
sum( ~ totalDischarges, data = MedicareCharges)
## [1] 6975318

These notes, as in the Data Computing Book, use the dialect provided by the dplyr package. For very simple tasks such as this, dplyr will seem more verbose, but dplyr also provides more power for complicated calculations.

MedicareCharges %>%
  summarize(total = sum(totalDischarges))
##     total
## 1 6975318

Of course, all three dialects provide the same answer: about 7 million in-patient procedures are contained in the MedicareCharges data.

How much is Spent?

How much did Medicare pay out for these services? We expect that it will be many billions of dollars.

MedicareCharges %>%
  summarise(paid = sum(totalDischarges * avePayments),
            charged = sum(totalDischarges * aveCharges))
##          paid      charged
## 1 66689523870 245977912911

$246 billion was charged. “Only” $67 billion was paid.

Variation by provider

Do different providers charge and get paid differently? To answer this question, it would be helpful to have a data table that looks like this:

Provider | Average payment
---------|----------------- 
10001    | 24000
10005    | 29000
         | and so on.

The numbers here are made up, but the structure of the data table is not. Each row is one case: a provider. Each column is one variable: the provider ID and the average charge.

Data like this are called “glyph-ready.” Once you have data in this form, it can be very easy to make whatever plot or other display you want.

Wrangling is the process by which data are manipulated (or “transformed” or “processed”) from their original form into a glyph-ready form. Here’s the original form of the MedicareCharges table:

head(MedicareCharges)
##   drg idProvider totalDischarges aveCharges avePayments
## 1 039      10001              91   32963.08    5777.242
## 2 057      10001              38   20312.79    4894.763
## 3 064      10001              84   38820.39   10260.214
## 4 065      10001             169   27345.10    6542.089
## 5 066      10001              33   17605.52    4596.394
## 6 069      10001              37   20688.84    4134.108

It’s important to be aware that the meaning of a case in the glyph-ready data is different from that in the MedicareCharges data.

EXERCISE: What is a case in the glyph-ready data? What is a case in the MedicareCharges data?

Just to show what a wrangling statement looks like, here’s one for producing the glyph-ready data for payments to providers:

ProviderPayments <-
  MedicareCharges %>%
  group_by(idProvider) %>%
  summarise(avePayment = mean(avePayments))

The glpyh-ready table looks like this:

ProviderPayments
## Source: local data frame [3,337 x 2]
## 
##    idProvider avePayment
## 1       10001   8749.030
## 2       10005   6812.136
## 3       10006   8197.242
## 4       10007   4860.834
## 5       10008   5898.142
## 6       10009   5224.723
## 7       10010   6077.341
## 8       10011   8181.652
## 9       10012   6038.123
## 10      10015   4763.561
## ..        ...        ...

Displays of the variation in payments can be made in several ways. For instance:

densityplot( ~ avePayment, data = ProviderPayments)

sd( ~ avePayment, data = ProviderPayments)
## [1] 3179.064
favstats( avePayment, data = ProviderPayments)
##       min       Q1   median       Q3      max     mean       sd    n
##  3291.319 7028.455 8604.432 10532.16 50552.61 9177.936 3179.064 3337
##  missing
##        0

There’s a lot of variation!

Perhaps that might be due to different providers offering different procedures? To examine the variation from one DRG to another, a glyph-ready table like this would be appropriate:

DRG | average payment
----|-----------------
039 | 33000
057 | 20000
    | and so on

The wrangling statement for producing this glyph-ready data table is very similar to that used to create the provider-by-provider payments. (Try figuring it out yourself.)

Here’s a display of the result:

Variations by State?

Now consider the variation in average payments and charges by state. There are many ways to display variation: histograms, density plots, etc. But before any of these can be constructed, you’ll need to have a data table that looks more or less like this:

State | average payment | average charge ——|——————|————- AK | 5000 | 15000 AL | 6000 | 12000 | and so on. |

Here is one possible display of the information:

## Joining by: "idProvider"

EXERCISE: Graphics choices

Describe some of the choices made in this display. Comment on whether they are effective or not for addressing these questions: How much variation is there in charges? In payments? Do states that charge a lot get paid a lot? How much larger are charges than payments? How does a particular state compare to the rest?

Another possible display shows the variation with DRG within each state while allowing the states themselves to be easily compared.

## Joining by: "idProvider"

Range of Costs by DRG

ggplot(DRGStateCharges, aes(x=drg, y=amount )) +
  geom_boxplot( ) + 
    scale_y_log10(breaks = 1000*c(1,5,10,15,20,30,40,80,160,320)) + 
  xlab("DRG") + 
  ylab("USD per procedure") +
  theme(axis.text.x = element_text(angle = 60)) +
  facet_grid( direction ~ . )

Comparing the 3 highest and 3 lowest cost states and some other big ones

DRG 872 (for example)

The DRGs vary in the same way within a state. Let’s pick one DRG to investigate: 872: SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC

DRG872 <-
  MedicareCharges %>%
  filter(drg == 872)
DRG872 <-
  MedicareProviders %>%
  select(idProvider, stateProvider, zipProvider) %>%
  left_join(DRG872) 
## Joining by: "idProvider"

Demographics

Although the CMS data has no demographic information, other sources do. The DataComputing package provides ZipDemography and ZipGeography with information on population, age distribution, income, disability, etc.

## Joining by: "ZIP"

Looking at Charged versus demography

ggplot( DRG872, aes(x=Over65/Pop, y=aveCharges)) +
  geom_point(alpha = 0.5) + scale_y_log10() 
## Warning: Removed 1172 rows containing missing values (geom_point).

ggplot( DRG872, aes(x=Income, y=aveCharges)) +
  geom_point(alpha = 0.5) + scale_y_log10() 
## Warning: Removed 1171 rows containing missing values (geom_point).

ggplot( DRG872, aes(x=LandArea, y=aveCharges)) +
  geom_point(alpha = 0.5) + scale_y_log10() + scale_x_log10()
## Warning: Removed 1001 rows containing missing values (geom_point).

ggplot( DRG872, aes(x=Pop/LandArea, y=aveCharges)) +
  geom_point(alpha = 0.5) + scale_y_log10() + scale_x_log10()
## Warning: Removed 1171 rows containing missing values (geom_point).

The last two plots suggest that there are two clusters of providers based on demographic variables. If desired, simple machine-learning techniques could be used to define the clusters.

The New York Times map

DRG872 <- 
  DRG872 %>% 
  transform(Rgroup=ntiles(aveCharges,4))

ggplot(DRG872 %>% filter( !stateProvider %in% c("AK","HI")), aes(x=Longitude,y=Latitude,color=Rgroup)) + 
  geom_point(alpha=.6) + 
  geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="white", fill=NA )
## Warning: Removed 6 rows containing missing values (geom_point).

Or, if you prefer,

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=US&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=US&sensor=false
## Warning: Removed 33 rows containing missing values (geom_point).

Doing things for all the DRGs

One way to compare institutions across DRGs is to consider the residual of each charge to the average for that DRG.

Charges <- MedicareProviders %>%
  select(idProvider, stateProvider, zipProvider) %>%
  left_join(MedicareCharges) %>% 
  inner_join( Zips, by = c(zipProvider = "ZIP"))
## Joining by: "idProvider"
mod <- lm(log10(aveCharges) ~ drg, data = Charges)
Charges$resid <- resid(mod)
Charges <- 
  Charges %>%
  mutate(rgroup = ntiles(resid, 9))
ggplot( Charges, aes(x=Pop/LandArea, y=resid)) +
  geom_point(alpha=.02) + 
  scale_x_log10()
## Warning: Removed 25478 rows containing missing values (geom_point).

Or, looking at average residuals (compared to entire nation) of each provider:

Or, more locally

0.4 on a log10 scale is a factor of 2.5 in charges.

Models

When examining the relationships among multiple variables, linear models can be a good place to start. Note that glyph-ready data is also model-ready.

Something is wrong in the join with Zips.

## Joining by: "idProvider"
mod2 <- lm(mresid ~ Income + Pop, data = ResidsWithDemographics)
rsquared(mod2)
## [1] 0.1167188
anova(mod2)
## Analysis of Variance Table
## 
## Response: mresid
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## Income       1   6.611  6.6112  174.10 < 2.2e-16 ***
## Pop          1   8.091  8.0911  213.07 < 2.2e-16 ***
## Residuals 2930 111.261  0.0380                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod3 <- lm(mresid ~ 
             pop_density + Income + college_educated + 
             stateProvider + Latitude, 
           data=ResidsWithDemographics )
rsquared(mod3)
## [1] 0.4061074
anova(mod3)
## Analysis of Variance Table
## 
## Response: mresid
##                    Df Sum Sq Mean Sq  F value    Pr(>F)    
## pop_density         1  4.339  4.3394 167.3250 < 2.2e-16 ***
## Income              1  6.527  6.5270 251.6751 < 2.2e-16 ***
## college_educated    1  0.291  0.2909  11.2159 0.0008214 ***
## stateProvider      43 39.823  0.9261  35.7102 < 2.2e-16 ***
## Latitude            1  0.164  0.1645   6.3428 0.0118397 *  
## Residuals        2884 74.794  0.0259                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

  1. If “suitcase” is too informal for you, an alternative word is “portmanteau.”

  2. It would have been nicer if the single item in the suitcase had a name related to the suitcase itself, e.g. Inpatient_Data_2013.csv. Then you would know exactly what to expect.