The cohort design
The prevailing study design for investigating a new diagnostic test is a prospective blind comparison of the experimental test and the diagnostic reference standard (“gold standard”) in a consecutive series of patients from a representative clinical population [2–4]. In these prospective diagnostic accuracy cohort studies, patients with suspected disease undergo both an experimental diagnostic intervention and the diagnostic reference standard (Figure2). Diagnostic accuracy or the performance of the experimental diagnostic intervention is measured using a 2 × 2 table, and sensitivity, specificity, likelihood ratios, diagnostic odd ratios and accuracy can be calculated. Indeed, much literature has been published on the standards that should be met when evaluating diagnostic studies and this literature is primarily focused on diagnostic accuracy cohort studies [2–5]. The advantages of a diagnostic accuracy cohort study include their simplicity to perform, they are relatively inexpensive, and they are well accepted among the medical research community. They are appropriate for the early investigation of an experimental diagnostic test to determine if it is appropriate to continue investigating the experimental diagnostic test. However, while providing diagnostic accuracy information, diagnostic accuracy cohort studies are not directly tied to patient outcomes (Figure2). Indeed, we often forget that diagnostic tests alone do not improve patient outcomes. Only when diagnostic accuracy is coupled with effective therapy (or noxious therapy) can outcomes be influenced.
The diagnostic randomized controlled trial
Twenty-five years ago, Guyatt and colleagues proposed that “diagnostic technology should be disseminated only if they are less expensive, produce fewer untoward effects, at least as accurate as existing technologies, eliminate the need for other diagnostic interventions, without loss of accuracy or lead to the institution of effective therapy” [6]. They further stated that “establishing patient benefit often requires randomized controlled trials”. This call to action continues to be advocated in the literature but largely ignored [7, 8].
We define diagnostic randomized controlled trials as randomized comparisons of two diagnostic interventions (one standard and one experimental) with identical therapeutic interventions based on the results of the competing diagnostic interventions (for example, disease: yes or no) and with the study outcomes being clinically important consequences of diagnostic accuracy (Figures 3 and 4). While diagnostic cohort studies inform us about the relative accuracy of an experimental diagnostic intervention compared to a reference standard, they do not inform us about whether the differences in accuracy are clinically important, or the degree of clinical importance (in other words, the impact on patient outcomes).
We propose that conducting diagnostic randomized controlled trials is critical in the evaluation of diagnostic technologies and, in particular, novel technologies in the presence of standard diagnostic tests. Our reasons include the following: 1) diagnostic randomized controlled trials permit a direct comparison of the experimental test to the standard test with clinically relevant outcomes as opposed to simply comparing them to each other; 2) diagnostic randomized controlled trials can result in the experimental test being better than the reference standard (which by definition cannot occur with diagnostic cohort studies); 3) diagnostic randomized controlled trials can be conducted where there is no accepted reference standard (which cannot be done with diagnostic cohort studies); 4) test properties (sensitivity, specificity, likelihood ratios, accuracy etc.) can still be calculated in diagnostic randomized controlled trials if the reference standard is conducted and the results kept blinded in the experimental group and the results of the experimental tests are conducted but kept blinded in the reference group; and 5) diagnostic technologies could be compared across disciplines to help policy makers decide in which new diagnostic technologies to invest.
The value of the diagnostic randomized controlled trial design
In our example, the therapeutic response to the diagnostic question is whether to anticoagulate patients with confirmed DVT to prevent recurrent DVT. In the study by Kearon and colleagues [1] outcome event rates were not different, likely by virtue of missing clinically unimportant DVT in the serial ultrasound approach.
Venography is considered the reference standard for DVT. Diagnostic accuracy cohort studies would conclude that nothing could beat the “gold standard”. However, one could imagine a new test that does not diagnose clinically insignificant DVT (in other words, more specific for clinically relevant DVT) but identifies small DVT that venography might miss that will grow and ultimately become clinically significant. For Star Trek fans we will call this the “important clot tri-corder”. If the tri-corder were compared to venography in a diagnostic accuracy cohort study only, we might abandon the technology because of insensitivity for small inconsequential DVT. However, if a diagnostic randomized controlled trial were conducted, it is possible that we demonstrate its superiority to venography.
If venography were abolished by decree from government, we would not be left unable to evaluate diagnostic technologies for DVT. If we adopted diagnostic randomized controlled trials, the “important clot tri-corder” could be compared to serial ultrasounds and with clinical relevant important outcomes and then we could decide if the tri-corder should be adopted.
Venography is also complicated by the fact that it is frequently either indeterminate (inadequate contrast filling of veins) or cannot be done (unable to obtain venipuncture, contrast dye allergies, renal dysfunction). The 2 × 2 table and measures of diagnostic accuracy do not reflect this limitation of the test in actual practice nor the spectrum bias introduced by omitting these patients. The rate of indeterminate or “cannot be done tests” may be provided in manuscripts permitting the reader to contemplate the generalizability of a diagnostic cohort study to their setting but the rate of indeterminate or cannot be done tests is often omitted from publications of diagnostic cohort studies [9, 10]. Diagnostic randomized controlled trials analyzed by “intention-to-test” would incorporate these often neglected outcome effects of either indeterminate tests or tests that could not be done in both arms of a trial. Furthermore, beyond generalizability, it is plausible that patients with indeterminate tests or tests that could not be done are different than those with determinate tests, and that these differences may confound the accuracy of a diagnostic test (in other words, spectrum bias). For example, venograms often cannot be done in patients with a lot of edema due to the inability to obtain a venipuncture for the procedure. However edema is a sign of important venous thrombosis (caused by large vein obstruction; for example, thigh veins). In a diagnostic cohort study, these patients would be excluded from the accuracy analysis and this will lead to a bias against the experimental diagnostic test. The important “clot tri-corder” would have made the diagnosis but given that fewer patients with thigh vein thrombosis relative to calf vein thrombosis (where there is usually less edema) would be included in the diagnostic accuracy analysis versus venogram, the tri-corder would seemingly have lower sensitivity. A diagnostic randomized controlled trial intention-to-test analysis would eliminate this bias.
Diagnostic accuracy can still be determined in a diagnostic randomized controlled trial when the gold standard and the experimental diagnostic test are conducted but the results of one test are randomly kept blinded (Figure4). In our example, both the tri-corder and venography could be conducted by the radiologist and only the result of the randomly assigned test are disclosed to the treating physician. Patient outcomes are then evaluated on follow-up. Diagnostic accuracy (sensitivity, specificity, likelihood ratios, accuracy etc.) of the tri-corder could still be reported at the conclusion of the trial in comparison to venography.