In May 2013, the Centers for Medicare and Medicaid Services released data on the price Medicare fee-for-service health services at hospitals around the US in year 2011. They wrote:
As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that summarize the utilization and payments for procedures and services provided to Medicare fee-for service beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers. These data include information for the 100 most common inpatient services, 30 common outpatient services, and all physician and other supplier procedures and services performed on 11 or more Medicare beneficiaries. Providers determine what they will charge for items,services, and procedures provided to patients and these charges are the amount the providers bill for an item, service, or procedure. -source
This attracted press attention. For example, a headline in the New York Times read “Hospital Billing Varies Wildly, Government Data Shows”.
A map with the article shows the geographic disparities in hospital charges.
The news reports emphasized geographic disparities in the billing for procedures. I’m interested to know whether this might depend on other variables, for instance:
The data themselves are distributed by the Centers for Medicare and Medicade Services as a zip-compressed CSV file.
The download page describes the formats available.
Data are available in two formats:
- Tab delimited file format (requires importing into database or statistical software; SAS® read-in language is included in the download ZIP package)
- Microsoft Excel format (.xlsx), split by provider last name (note: organizational providers with name starting with a numeric are available in the “YZ” file).
CMS has also created two summary tables: 1) aggregated information by physician or other supplier and 2) aggregated information by State and HCPCS code. A detailed methodology document can be found in the Downloads section below which contains important information regarding the limitations of data usage.
Tab Delimited Format:
Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format [Note: This Compressed ZIP package contains the tab delimited data file (Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt) which is 1.7GB uncompressed and contains more than 9 million records, thus importing this file into Microsoft Excel will result in an incomplete loading of data. Use of database or statistical software is required; a SAS® read-in statement is supplied. Additionally, this ZIP package contains the following supporting documents: CMS-AMA-CPT-2011-license-agreement.pdf, Medicare-Physician-and-Other-Supplier-PUF-SAS-Infile.sas, Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf]
Microsoft Excel Format:
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (A)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (B)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (CD)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (EFG)
… and so on.
The compressed tab-delimited file is 395 MB. Uncompressed it’s 1700 MB. This seems large.
How do I even look at these data? Could a student manage this in the time frame of, say a course project?
I downloaded the tab-delimited file (taking about 45 minutes with a broadband connection), uncompressed it (taking about 0.5 % of my disk space), read it into R, and then saved it in a standard R format. It’s now part of the DCFdevel
package.
Perhaps surprisingly, all that data, in the standard R format, is only 2.06 MB. What accounts for the disparity? Do I have the right data?
The New York Times says:
The data for 3,300 hospitals, released by the federal Centers for Medicare and Medicaid Services, shows wide variations not only regionally but among hospitals in the same area or city.
A quick summary of the data using commands that students might encounter in an introductory statistics course:
## [1] 163065
## [1] "DRG.Definition"
## [2] "Provider.Id"
## [3] "Provider.Name"
## [4] "Provider.Street.Address"
## [5] "Provider.City"
## [6] "Provider.State"
## [7] "Provider.Zip.Code"
## [8] "Hospital.Referral.Region.Description"
## [9] "Total.Discharges"
## [10] "Average.Covered.Charges"
## [11] "Average.Total.Payments"
## [1] 3201
That corresponds well with the New York Times description.
So why are there 163065 rows if there are only 3201 hospitals? Because there is a separate listing for each of 100 procedures (known as a Direct Recovery Group (DRG)).
## [1] 100
A typical provider is listed roughly 50 times. More precisely, each provider is listed this many times on average:
## [1] 50.94
That’s an ugly command. MedicareSpending
is listed twice.
We’re going to be using a different system for data calculations, called dplyr
.
## aveProc
## 1 50.94
As part of a sanity check of the data, look at how much Medicare paid out for these services. We expect that it will be several billion dollars.
## TotalPaid TotalCharged
## 1 66.69 246
$246 billion was charged. “Only”" $67 billion was paid
## Using Provider.State, State as id variables
Maybe better to show the variation within states:
## Using State as id variables
Now by DRG:
## Using State, DRG as id variables
The DRGs vary in the same way within a state. Let’s pick one DRG to investigate. Perhaps the one that appears in the most states.
## Source: local data frame [11 x 4]
##
## DRG med nStates nZips
## 1 394 6601 51 1460
## 2 699 6625 49 879
## 3 491 6638 51 909
## 4 439 6711 49 918
## 5 176 6849 51 1344
## 6 287 7147 51 1623
## 7 872 7265 51 2364
## 8 065 7280 51 2141
## 9 281 7492 51 1495
## 10 640 7506 51 1644
## 11 190 7532 51 2568
Let’s take DRG 872: SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC
This service is provided in 2505 ZIP codes. Almost all of those ZIP codes have only one provider:
##
## 1 2 3 4
## 2231 126 6 1
How much does the cost differ by state?
Let’s pick some ZIP code variables:
## Joining by: "ZIP"
## Joining by: "ZIP"
## Warning: Removed 161 rows containing missing values (geom_point).
## Warning: Removed 160 rows containing missing values (geom_point).
## Warning: Removed 160 rows containing missing values (geom_point).
## Warning: Removed 160 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
Fit log(Charge)
to DRG
and take the residual.
## Joining by: "ZIP"
## Warning: Removed 12230 rows containing missing values (geom_point).
## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 42 rows containing missing values (geom_point).
## Warning: Removed 57 rows containing missing values (geom_point).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 74 rows containing missing values (geom_point).
## Warning: Removed 54 rows containing missing values (geom_point).
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: Removed 51 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_point).
## [1] 0.04656
## Analysis of Variance Table
##
## Response: Resid
## Df Sum Sq Mean Sq F value Pr(>F)
## I(Pop/LandArea) 1 140 139.7 3238 <2e-16 ***
## Income 1 139 139.2 3224 <2e-16 ***
## I(College/Pop) 1 10 9.5 220 <2e-16 ***
## Residuals 136814 5905 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## [1] 0.4295
## Analysis of Variance Table
##
## Response: Resid
## Df Sum Sq Mean Sq F value Pr(>F)
## I(Pop/LandArea) 1 140 139.7 5409 <2e-16 ***
## Income 1 139 139.2 5386 <2e-16 ***
## I(College/Pop) 1 10 9.5 368 <2e-16 ***
## State 43 2367 55.0 2130 <2e-16 ***
## Latitude 1 5 5.1 196 <2e-16 ***
## Residuals 136770 3534 0.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Can we capture the effect of State
with demographic variables?