High costs for medical care have been a national concern for more than a decade. There is a wealth of data about medical spending; insurance companies and the government pay most medical costs and keep extensive and detailed records. Nonetheless, until recently there has been little data available to journalists and other citizens. The insurance and government data systems are complex and few other organizations have had the resources to study the data. That has changed in the last few years with “open data” programs in government.
In May 2013, the Centers for Medicare and Medicaid Services (CMS) released data on the price Medicare fee-for-service health services at hospitals around the US in year 2011.
This attracted press attention. For example, a headline in the New York Times read “Hospital Billing Varies Wildly, Government Data Shows”.
A map with the article shows the geographic disparities in hospital charges.
The national news reports emphasize geographic disparities in the billing for procedures. A citizen-statistician might be interested to dig deeper into these disparities, perhaps with a local focus. Or, she might want to investigate what other factors are associated with varying prices, for instance:
CMS makes the data available in four formats at this this download page with similar pages for years 2011 and 2012. The four formats are
As you can see, there are two kinds of data: “charge data” and “national and state summaries.” There are also two kinds of data table file formats: Excel and CSV. Usually, as here, the two file formats contain the same data; you only need to use one of them. It’s generally easier to use the CSV format.
A codebook in PDF format is provided under a link labelled methodology. This CMS documentation doesn’t specify in a concise way what a case means. After you load the data into R, you can figure this out.
For convenience the 2011 inpatient data are available in the DataComputing
package in the form of three data tables: MedicareCharges
, MedicareProviders
, and DirectRecoveryGroups
. This case study will work with those tables.
The New York Times says:
The data for 3,300 hospitals, released by the federal Centers for Medicare and Medicaid Services, shows wide variations not only regionally but among hospitals in the same area or city.
A quick summary of the data using commands that students might encounter in an introductory statistics course:
nrow(MedicareCharges)
## [1] 163065
nrow(MedicareProviders)
## [1] 3337
The second number corresponds well with the New York Times description. But why are there 163065 rows if there are only 3337 hospitals? Because there is a separate listing for each of 100 procedures (known as a Direct Recovery Group (DRG)).
nrow(DirectRecoveryGroups)
## [1] 100
If every provider had a listing for every DRG, there would be 3337 times 100 cases in MedicareCharges
. There’s only about half that many; a typical provider is listed roughly 50 times.
It can be instructive to follow the process of loading the data directly from the Centers for Medicare and Medicaid Studies into R. Again, you don’t need to do this unless you want to experience the process or apply the pattern to a new source of data, for instance the 2013 or other more recent data.
Using a browser, download the file at the link to the “inpatient data” in CSV format. As you’ll see, the result is a file named Inpatient_Data_2013_CSV.zip
. This is not actually a .csv
file; it is a .zip
compressed file. A .zip
file is analogous to a suitcase: it can hold several different items in a compressed format suitable for easy transport.1 You “unzip” such a file to gain access to the uncompressed contents.
The Inpatient_Data_2013_CSV.zip
contains just one item. Opening the suitcase (that is, “unzipping” it) extracts that item: Medicare_Provider_Charge_Inpatient_DRG100_FY2013.csv
.2
Similarly, you can uncompress the “summary” zip file. This will reveal two files:
Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRG_FY2013.csv
Medicare_Charge_Inpatient_DRG100_DRG_Summary_by_DRGState_FY2013.csv
It turns out that you won’t need the summary files, but look at them anyways.
It’s informative to look at the size of the .csv
files. This helps you to evaluate whether or not you will have problems handling the files.
A file of 26.9 MB is easily handled by today’s personal computers. Note that the summary files are much smaller. That makes sense for a summary! The DRG/State summary file is about 50 times larger than the DRG summary file. Why do you think? What is it about states in the US that suggests a number like 50?
You can load the .csv
data into R with commands like these:
Inpatients <- readr::read_csv(file.choose())
# Navigate to the CSV file
# And similarly for the two summary files
DRGState <- readr::fread(file.choose())
DRG <- readr::fread(file.choose())
Once you have read in these three data tables, look at their sizes in the Environment tab in RStudio. It’s common for the table stored in R to be much smaller than the CSV file it was created from, in this case 13.2 MB versus 26.9 MB for the inpatient data.
Download the .zip
files for both year 2011 and 2013. The contents of the files are arranged differently in the two years. (This is not good!) Explain how the contents (that is, the files, their names, and their location) are different.
It’s useful to compare the data to your expectations for what values should be. This can help reveal your misconceptions about what the data are. So ask some easy questions for which you already have an approximate answer.
There are about 300 million people in the US. It’s reasonable to expect that millions of inpatient procedures are performed, only a fraction of which are covered by Medicare or Medicaid. The totalDischarges
variable gives the number of people in each DRG for each hospital. Add these up to get the total number of procedures listed in the data.
R provides several different “dialects” for such calculations. In “base R” you might write
with(data = MedicareCharges, sum(totalDischarges))
## [1] 6975318
With the mosaic
package you can write this more concisely (and do more complex calculations such as breaking down the total by some other variable):
library(mosaic)
sum( ~ totalDischarges, data = MedicareCharges)
## [1] 6975318
These notes, as in the Data Computing Book, use the dialect provided by the dplyr
package. For very simple tasks such as this, dplyr
will seem more verbose, but dplyr
also provides more power for complicated calculations.
MedicareCharges %>%
summarize(total = sum(totalDischarges))
## total
## 1 6975318
Of course, all three dialects provide the same answer: about 7 million in-patient procedures are contained in the MedicareCharges
data.
How much did Medicare pay out for these services? We expect that it will be many billions of dollars.
MedicareCharges %>%
summarise(paid = sum(totalDischarges * avePayments),
charged = sum(totalDischarges * aveCharges))
## paid charged
## 1 66689523870 245977912911
$246 billion was charged. “Only” $67 billion was paid.
Do different providers charge and get paid differently? To answer this question, it would be helpful to have a data table that looks like this:
Provider | Average payment
---------|-----------------
10001 | 24000
10005 | 29000
| and so on.
The numbers here are made up, but the structure of the data table is not. Each row is one case: a provider. Each column is one variable: the provider ID and the average charge.
Data like this are called “glyph-ready.” Once you have data in this form, it can be very easy to make whatever plot or other display you want.
Wrangling is the process by which data are manipulated (or “transformed” or “processed”) from their original form into a glyph-ready form. Here’s the original form of the MedicareCharges
table:
head(MedicareCharges)
## drg idProvider totalDischarges aveCharges avePayments
## 1 039 10001 91 32963.08 5777.242
## 2 057 10001 38 20312.79 4894.763
## 3 064 10001 84 38820.39 10260.214
## 4 065 10001 169 27345.10 6542.089
## 5 066 10001 33 17605.52 4596.394
## 6 069 10001 37 20688.84 4134.108
It’s important to be aware that the meaning of a case in the glyph-ready data is different from that in the MedicareCharges
data.
EXERCISE: What is a case in the glyph-ready data? What is a case in the MedicareCharges
data?
Just to show what a wrangling statement looks like, here’s one for producing the glyph-ready data for payments to providers:
ProviderPayments <-
MedicareCharges %>%
group_by(idProvider) %>%
summarise(avePayment = mean(avePayments))
The glpyh-ready table looks like this:
ProviderPayments
## # A tibble: 3,337 x 2
## idProvider avePayment
## <int> <dbl>
## 1 10001 8749.030
## 2 10005 6812.136
## 3 10006 8197.242
## 4 10007 4860.834
## 5 10008 5898.142
## 6 10009 5224.723
## 7 10010 6077.341
## 8 10011 8181.652
## 9 10012 6038.123
## 10 10015 4763.561
## # ... with 3,327 more rows
Displays of the variation in payments can be made in several ways. For instance:
densityplot( ~ avePayment, data = ProviderPayments)
sd( ~ avePayment, data = ProviderPayments)
## [1] 3179.064
favstats( ~ avePayment, data = ProviderPayments)
## min Q1 median Q3 max mean sd n
## 3291.319 7028.455 8604.432 10532.16 50552.61 9177.936 3179.064 3337
## missing
## 0
There’s a lot of variation!
Perhaps that might be due to different providers offering different procedures? To examine the variation from one DRG to another, a glyph-ready table like this would be appropriate:
DRG | average payment
----|-----------------
039 | 33000
057 | 20000
| and so on
The wrangling statement for producing this glyph-ready data table is very similar to that used to create the provider-by-provider payments. (Try figuring it out yourself.)
Here’s a display of the result:
Now consider the variation in average payments and charges by state. There are many ways to display variation: histograms, density plots, etc. But before any of these can be constructed, you’ll need to have a data table that looks more or less like this:
State | average payment | average charge |
---|---|---|
AK | 5000 | 15000 |
AL | 6000 | 12000 |
and so on. |
Here is one possible display of the information:
## Joining, by = "idProvider"
EXERCISE: Graphics choices
Describe some of the choices made in this display. Comment on whether they are effective or not for addressing these questions: How much variation is there in charges? In payments? Do states that charge a lot get paid a lot? How much larger are charges than payments? How does a particular state compare to the rest?
Another possible display shows the variation with DRG within each state while allowing the states themselves to be easily compared.
## Joining, by = "idProvider"
ggplot(DRGStateCharges, aes(x=drg, y=amount )) +
geom_boxplot( ) +
scale_y_log10(breaks = 1000*c(1,5,10,15,20,30,40,80,160,320)) +
xlab("DRG") +
ylab("USD per procedure") +
theme(axis.text.x = element_text(angle = 60)) +
facet_grid( direction ~ . )
The DRGs vary in the same way within a state. Let’s pick one DRG to investigate: 872: SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC
DRG872 <-
MedicareCharges %>%
filter(drg == 872)
DRG872 <-
MedicareProviders %>%
select(idProvider, stateProvider, zipProvider) %>%
left_join(DRG872)
## Joining, by = "idProvider"
Although the CMS data has no demographic information, other sources do. The DataComputing
package provides ZipDemography
and ZipGeography
with information on population, age distribution, income, disability, etc.
## Joining, by = "ZIP"
ggplot( DRG872, aes(x=Over65/Pop, y=aveCharges)) +
geom_point(alpha = 0.5) + scale_y_log10()
## Warning: Removed 1172 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=Income, y=aveCharges)) +
geom_point(alpha = 0.5) + scale_y_log10()
## Warning: Removed 1171 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=LandArea, y=aveCharges)) +
geom_point(alpha = 0.5) + scale_y_log10() + scale_x_log10()
## Warning: Removed 1001 rows containing missing values (geom_point).
ggplot( DRG872, aes(x=Pop/LandArea, y=aveCharges)) +
geom_point(alpha = 0.5) + scale_y_log10() + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1171 rows containing missing values (geom_point).
The last two plots suggest that there are two clusters of providers based on demographic variables. If desired, simple machine-learning techniques could be used to define the clusters.
DRG872 <-
DRG872 %>%
transform(Rgroup=ntiles(aveCharges,4))
ggplot(DRG872 %>% filter( !stateProvider %in% c("AK","HI")), aes(x=Longitude,y=Latitude,color=Rgroup)) +
geom_point(alpha=.6) +
geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="white", fill=NA )
## Warning: Removed 6 rows containing missing values (geom_point).
Or, if you prefer,
## Source : https://maps.googleapis.com/maps/api/staticmap?center=US&zoom=4&size=640x640&scale=2&maptype=terrain&language=en-EN
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=US
## Warning: Removed 33 rows containing missing values (geom_point).
One way to compare institutions across DRGs is to consider the residual of each charge to the average for that DRG.
Charges <- MedicareProviders %>%
select(idProvider, stateProvider, zipProvider) %>%
left_join(MedicareCharges) %>%
inner_join( Zips, by = c(zipProvider = "ZIP"))
## Joining, by = "idProvider"
mod <- lm(log10(aveCharges) ~ drg, data = Charges)
Charges$resid <- resid(mod)
Charges <-
Charges %>%
mutate(rgroup = ntiles(resid, 9))
ggplot( Charges, aes(x=Pop/LandArea, y=resid)) +
geom_point(alpha=.02) +
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 25478 rows containing missing values (geom_point).
Or, looking at average residuals (compared to entire nation) of each provider:
Or, more locally
0.4 on a log10 scale is a factor of 2.5 in charges.
When examining the relationships among multiple variables, linear models can be a good place to start. Note that glyph-ready data is also model-ready.
Something is wrong in the join with Zips.
## Joining, by = "idProvider"
mod2 <- lm(mresid ~ Income + Pop, data = ResidsWithDemographics)
rsquared(mod2)
## [1] 0.1167188
anova(mod2)
## Analysis of Variance Table
##
## Response: mresid
## Df Sum Sq Mean Sq F value Pr(>F)
## Income 1 6.611 6.6112 174.10 < 2.2e-16 ***
## Pop 1 8.091 8.0911 213.07 < 2.2e-16 ***
## Residuals 2930 111.261 0.0380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mod3 <- lm(mresid ~
pop_density + Income + college_educated +
stateProvider + Latitude,
data=ResidsWithDemographics )
rsquared(mod3)
## [1] 0.4061074
anova(mod3)
## Analysis of Variance Table
##
## Response: mresid
## Df Sum Sq Mean Sq F value Pr(>F)
## pop_density 1 4.339 4.3394 167.3250 < 2.2e-16 ***
## Income 1 6.527 6.5270 251.6751 < 2.2e-16 ***
## college_educated 1 0.291 0.2909 11.2159 0.0008214 ***
## stateProvider 43 39.823 0.9261 35.7102 < 2.2e-16 ***
## Latitude 1 0.164 0.1645 6.3428 0.0118397 *
## Residuals 2884 74.794 0.0259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1