Tuesday, November 28, 2006

Promising data on clinical performance measurement

A lot of us hold the faith that pay-for-performance will be key to getting wider adoption of health IT systems in physician offices. This faith rests on the assumption that we can get doctors to use EHRs to record clinical data in ways that are meaningful, easy to gather, and comparable across physicians and over time. An article in the most recent issue of the Archives of Internal Medicine highlights the opportunities and challenges here and I think will be just the beginning of a much richer and more realistic discussion of these issues.

The authors of "Assessing the Validity of National Quality Measures for Coronary Artery Disease Using an Electronic Health Record" looked at quality measurement at a large internal medicine practice using a "commercial EHR" (they don't say which EHR they're using). They found that the actual performance of the physicians along several measures was better than their estimated performance, which was calculated from data automatically extracted from their EHRs. They conclude that:
"Profiling the quality of outpatient CAD care using data from an EHR has significant limitations. Changes in how data are routinely recorded in an EHR are needed to improve the accuracy of this type of quality measurement."

My reading of their study is that the authors raise legitimate concerns about the difficulty of using such data, but their conclusion overstates their case. They correctly point out that the real issues are not about technology, per se, but about process -- physicians don't routinely enter data in a way that makes it easy to do accurate calculations. For example, if you don't enter blood pressure readings as numeric data in the blood pressure fields of the EHR, you won't get "credit" for having taken the blood pressure.

More generally, they point to four sources of error:

  1. Wrong diagnosis (ie, person diagnosed as having CAD when they didn't);
  2. Data not entered as numeric or structured data (ie, they may have treated a person with aspirin, for example, but they buried that fact in a text note rather than entering it in the medication list)
  3. Exclusion criteria are not standardized (ie, there's no standard way to record the fact that a patient may not qualify for the treatment in question, which shouldn't count against the physician); and
  4. Measures don't account for patient non-adherence (ie, the patient doesn't comply with the physician's treatment decision, for example, doesn't take a lipid-lowering drug even though the physician recommended it and prescribed it).

Frankly, the only one of these problems that I find new and thus troubling is the third one regarding exclusion criteria, because the others are well known and they will become less severe over time as people get used to them. Entering consistent exclusion criteria is very complicated, however, because they are so measure-, condition- and patient-specific, there are no standards out there that I'm aware of, and the EHRs I'm familiar with don't have a good way to record this information systematically anyway. So, we need to figure out a way to address this issue.

That said, it's not clear how big these problems are in the scheme of things, however. It turns out that even with these problems, the automated measures performed pretty well -- the physicians were at 82% success on the measure they did worst on, which improved to 87% once corrections were made for the issues noted above. On the measure they did best on their scores went from 98% success to 99% after adjustment. While we'd obviously like these measures to be as accurate as possible -- particularly when quality and compensation are on the line -- this level of mis-measurement is surprisingly low given that we're really at the beginning of the beginning of clinical performance measurement.

Perhaps a bigger issue that the study doesn't address is what to do about so-called "ceiling effects." How are we going to tell the difference between physicians who are already high-performing? Is there really a difference between physicians performing at 98% vs 99%? And how do we tell the difference between physicians who are both performing at 100%? This suggests that we'll need increasingly granular measures to show variation in performance, if that's what we want to show. Or, do we just care that physicians get to an acceptably high level?

No one has any answers to this yet, of course, but the data from this study suggest that we may have to figure it out long before at least I thought we would. And doing this without having physicians feel that the goalposts are always being moved could be a much bigger issue than the technical arguments about measures that dominate the conversation today.

No comments: