The Price of Health

Background

In May 2013, the Centers for Medicare and Medicaid Services released data on the price Medicare fee-for-service health services at hospitals around the US in year 2011. They wrote:

As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that summarize the utilization and payments for procedures and services provided to Medicare fee-for service beneficiaries by specific inpatient and outpatient hospitals, physicians, and other suppliers. These data include information for the 100 most common inpatient services, 30 common outpatient services, and all physician and other supplier procedures and services performed on 11 or more Medicare beneficiaries. Providers determine what they will charge for items,services, and procedures provided to patients and these charges are the amount the providers bill for an item, service, or procedure. -source

This attracted press attention. For example, a headline in the New York Times read “Hospital Billing Varies Wildly, Government Data Shows”.

NYT Map

A map with the article shows the geographic disparities in hospital charges.

Goals for the Study

The news reports emphasized geographic disparities in the billing for procedures. I’m interested to know whether this might depend on other variables, for instance:

The type of procedure.
The number of the type of procedure performed at the hospital.
The population in the locality.
Age, wealth, and other demographic characteristics of the area.

Data

The data themselves are distributed by the Centers for Medicare and Medicade Services as a zip-compressed CSV file.

The download page describes the formats available.

Data are available in two formats:

Tab delimited file format (requires importing into database or statistical software; SAS® read-in language is included in the download ZIP package)

Microsoft Excel format (.xlsx), split by provider last name (note: organizational providers with name starting with a numeric are available in the “YZ” file).

CMS has also created two summary tables: 1) aggregated information by physician or other supplier and 2) aggregated information by State and HCPCS code. A detailed methodology document can be found in the Downloads section below which contains important information regarding the limitations of data usage.

Tab Delimited Format:

Medicare Physician and Other Supplier PUF, CY2012, Tab Delimited format [Note: This Compressed ZIP package contains the tab delimited data file (Medicare-Physician-and-Other-Supplier-PUF-CY2012.txt) which is 1.7GB uncompressed and contains more than 9 million records, thus importing this file into Microsoft Excel will result in an incomplete loading of data. Use of database or statistical software is required; a SAS® read-in statement is supplied. Additionally, this ZIP package contains the following supporting documents: CMS-AMA-CPT-2011-license-agreement.pdf, Medicare-Physician-and-Other-Supplier-PUF-SAS-Infile.sas, Medicare-Physician-and-Other-Supplier-PUF-Methodology.pdf]

Microsoft Excel Format:

Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (A)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (B)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (CD)
Medicare Physician and Other Supplier PUF, CY2012, Microsoft Excel (.xlsx) Provider Last Name (EFG)

… and so on.

Getting the Data

The compressed tab-delimited file is 395 MB. Uncompressed it’s 1700 MB. This seems large.

How do I even look at these data? Could a student manage this in the time frame of, say a course project?

I downloaded the tab-delimited file (taking about 45 minutes with a broadband connection), uncompressed it (taking about 0.5 % of my disk space), read it into R, and then saved it in a standard R format. It’s now part of the DCFdevel package.

Perhaps surprisingly, all that data, in the standard R format, is only 2.06 MB. What accounts for the disparity? Do I have the right data?

Checking the Data

The New York Times says:

The data for 3,300 hospitals, released by the federal Centers for Medicare and Medicaid Services, shows wide variations not only regionally but among hospitals in the same area or city.

A quick summary of the data using commands that students might encounter in an introductory statistics course:

## [1] 163065

##  [1] "DRG.Definition"                      
##  [2] "Provider.Id"                         
##  [3] "Provider.Name"                       
##  [4] "Provider.Street.Address"             
##  [5] "Provider.City"                       
##  [6] "Provider.State"                      
##  [7] "Provider.Zip.Code"                   
##  [8] "Hospital.Referral.Region.Description"
##  [9] "Total.Discharges"                    
## [10] "Average.Covered.Charges"             
## [11] "Average.Total.Payments"

## [1] 3201

That corresponds well with the New York Times description.

So why are there 163065 rows if there are only 3201 hospitals? Because there is a separate listing for each of 100 procedures (known as a Direct Recovery Group (DRG)).

## [1] 100

A typical provider is listed roughly 50 times. More precisely, each provider is listed this many times on average:

## [1] 50.94

That’s an ugly command. MedicareSpending is listed twice.

We’re going to be using a different system for data calculations, called dplyr.

##   aveProc
## 1   50.94

How much is Spent?

As part of a sanity check of the data, look at how much Medicare paid out for these services. We expect that it will be several billion dollars.

##   TotalPaid TotalCharged
## 1     66.69          246

$246 billion was charged. “Only”" $67 billion was paid

How much do payments and charges vary by state?

## Using Provider.State, State as id variables

plot of chunk unnamed-chunk-8

Maybe better to show the variation within states:

## Using State as id variables

plot of chunk unnamed-chunk-9

Now by DRG:

Range of Costs by DRG

## Using State, DRG as id variables

plot of chunk unnamed-chunk-11

Comparing the 3 highest and 3 lowest cost states and some other big ones

plot of chunk unnamed-chunk-12

The DRGs vary in the same way within a state. Let’s pick one DRG to investigate. Perhaps the one that appears in the most states.

## Source: local data frame [11 x 4]
## 
##    DRG  med nStates nZips
## 1  394 6601      51  1460
## 2  699 6625      49   879
## 3  491 6638      51   909
## 4  439 6711      49   918
## 5  176 6849      51  1344
## 6  287 7147      51  1623
## 7  872 7265      51  2364
## 8  065 7280      51  2141
## 9  281 7492      51  1495
## 10 640 7506      51  1644
## 11 190 7532      51  2568

Let’s take DRG 872: SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W/O MCC

Bring in the ZIPs

This service is provided in 2505 ZIP codes. Almost all of those ZIP codes have only one provider:

## 
##    1    2    3    4 
## 2231  126    6    1

How much does the cost differ by state?

plot of chunk unnamed-chunk-16

Let’s pick some ZIP code variables:

## Joining by: "ZIP"
## Joining by: "ZIP"

Looking at Charged versus demography

## Warning: Removed 161 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-18

## Warning: Removed 160 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-18

## Warning: Removed 160 rows containing missing values (geom_point).

## Warning: Removed 160 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-18

## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-19

Doing things for all the DRGs

Fit log(Charge) to DRG and take the residual.

## Joining by: "ZIP"

plot of chunk unnamed-chunk-21

## Warning: Removed 12230 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-21

## Warning: Removed 12 rows containing missing values (geom_point).
## Warning: Removed 42 rows containing missing values (geom_point).
## Warning: Removed 57 rows containing missing values (geom_point).
## Warning: Removed 67 rows containing missing values (geom_point).
## Warning: Removed 74 rows containing missing values (geom_point).
## Warning: Removed 54 rows containing missing values (geom_point).
## Warning: Removed 43 rows containing missing values (geom_point).
## Warning: Removed 51 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-22

Models?

## [1] 0.04656

## Analysis of Variance Table
## 
## Response: Resid
##                     Df Sum Sq Mean Sq F value Pr(>F)    
## I(Pop/LandArea)      1    140   139.7    3238 <2e-16 ***
## Income               1    139   139.2    3224 <2e-16 ***
## I(College/Pop)       1     10     9.5     220 <2e-16 ***
## Residuals       136814   5905     0.0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## [1] 0.4295

## Analysis of Variance Table
## 
## Response: Resid
##                     Df Sum Sq Mean Sq F value Pr(>F)    
## I(Pop/LandArea)      1    140   139.7    5409 <2e-16 ***
## Income               1    139   139.2    5386 <2e-16 ***
## I(College/Pop)       1     10     9.5     368 <2e-16 ***
## State               43   2367    55.0    2130 <2e-16 ***
## Latitude             1      5     5.1     196 <2e-16 ***
## Residuals       136770   3534     0.0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Can we capture the effect of State with demographic variables?

The Price of Health

Daniel Kaplan

July 5, 2014

Background

Goals for the Study

Data

Getting the Data

Checking the Data

How much is Spent?

How much do payments and charges vary by state?

Range of Costs by DRG

Comparing the 3 highest and 3 lowest cost states and some other big ones

Bring in the ZIPs

Looking at Charged versus demography

Doing things for all the DRGs

Models?

Normalizing the Data