 Methodology
 Open Access
 Open Peer Review
 Published:
A comparison of approaches for adjudicating outcomes in clinical trials
Trialsvolume 18, Article number: 266 (2017)
Abstract
Background
Incorrect classification of outcomes in clinical trials can lead to biased estimates of treatment effect and reduced power. Ensuring appropriate adjudication methods to minimize outcome misclassification is therefore essential. While there are many reported adjudication approaches, there is little consensus over which approach is best.
Methods
Under the assumption of nondifferential assessment (i.e. that misclassification rates are the same in each treatment arm, as would typically be the case when outcome assessors are blinded), we use simulation and theoretical results to address four different questions about outcome adjudication: (a) How many assessors should be used? (b) When is it better to use onsite or central assessment? (c) Should central assessors adjudicate all outcomes, or only suspected events? (d) Should central assessment with multiple assessors be done independently or through group consensus?
Results
No one adjudication approach performs optimally in all settings. The optimal approach depends on the misclassification rates of site and central assessors, and the correlation between assessors. We found: (a) there will generally be little incremental benefit to using more than three assessors and, for outcomes with very high correlation between assessors, using one assessor is sufficient; (b) when choosing between site and central assessors, the assessor with the smallest misclassification rate should be chosen; when these rates are unknown, a combination of one site assessor and two central assessors will provide good results across a range of scenarios; (c) having central assessors adjudicate only suspected events will typically increase bias, and should be avoided, unless the threshold for sending outcomes for central assessment is extremely low; (d) central assessors can adjudicate either independently or in a group, and the preferred option should be dictated by whichever is expected to have the lowest misclassification rate.
Conclusions
Outcome adjudication is of critical importance to ensure validity of trial results, although no one approach is optimal across all settings. Investigators should choose the best strategy based on the specific characteristics of their trial. Regardless of the adjudication strategy chosen, assessors should be qualified and receive appropriate training.
Background
Many randomized controlled trials involve binary outcomes where either an event occurred or it did not (e.g. myocardial infarction, disease progression, patient response). In some situations it is selfevident whether an event occurred (e.g. mortality). However, in most situations it is less clear and may involve some subjectivity; thus, the outcome may need to be adjudicated to determine whether an event occurred. Appropriate adjudication of outcomes is of critical importance to the validity of trial results, as poor adjudication can lead to bias and a loss of power [1]. However there is little empirical evidence to help trial organizers select the best adjudication approach in various settings.
For example, clinical trials of ulcerative colitis often use an endoscopic video to determine whether an ulcer or a degree of inflammatory change is present or absent. There are several potential adjudication approaches in this setting. The outcome could be adjudicated either by a site assessor, who is directly involved in the patient’s care, and may be influenced by knowledge of the patient’s symptoms and their clinical history, or by a central assessor, who would only have access to the endoscopic video, and is blinded to all other clinical information. The number of adjudicators could be varied. The type of outcomes sent to the central assessors could vary (e.g. all outcomes vs. only suspected events). Different methods of adjudication could be used (for example, independent adjudication vs. group adjudication). There is currently little evidence to inform these choices.
Assuming that there is a ‘true’ but unknown outcome, there are four possible results after adjudication (Table 1); truepositives and truenegatives (where the assessor correctly judges that the event did or did not occur) and falsepositives and falsenegatives (where the assessor incorrectly judges that the event did or did not occur).
Falsepositives and falsenegatives are forms of misclassification [1]. Assuming that the misclassification is nondifferential (i.e. that there is the same probability of a falsenegative or falsepositive in each treatment arm, as would generally be the case when assessors are blinded to treatment arm), misclassification will lead to downward bias in the estimated treatment effect (i.e. the estimated treatment effect will be closer to the null than the true treatment effect), provided the true treatment effect is nonnull [1]. When the treatment effect is null, misclassification will not lead to bias.
The extent of the bias under a nonnull treatment effect depends on the rate of misclassification; the larger the rate of misclassification, the higher the degree of bias. For example, if the treatment effect was a difference in percentage points, the estimated treatment effect in the presence of misclassification will be biased by a factor of:
where φ is the falsepositive rate (the proportion of nonevents that were incorrectly classified as events) and θ is the falsenegative rate (the proportion of events that were incorrectly classified as nonevents) [2]. For example, assuming that the true difference in percentage points is 20%, and the falsepositive and falsenegative rates are both 5%, the estimated treatment effect would be 18% (i.e. it would be biased downwards by 10%). If, however, the falsepositive and falsenegative rates were both 15%, the estimated treatment effect would be 14% (a downward bias of 30%). Misclassification will also lead to a reduction in power, both because of the downward bias in the estimated treatment effect and because misclassification affects the variance of the estimated treatment effect [1].
Given the adverse consequences of misclassification of outcomes, reducing the misclassification rate is of critical importance to ensure the validity of results in randomized controlled trials. A number of case studies have compared results from different adjudication approaches in specific trials [3,4,5,6,7,8,9,10,11,12,13,14,15]; however, little attention has been paid to the statistical properties of the different approaches. In this paper, we address four main questions related to outcome adjudication: (a) How many assessors should we use? (b) Should we use site or central assessors to adjudicate outcomes? (c) Should central assessors adjudicate all outcomes, or only suspected events? (d) If we use central assessment with multiple assessors, should adjudication be done by each assessor independently, or as a group consensus (e.g. in an endpoint review committee)?
Methods
In this paper, we focus on the assessment approach for binary outcomes, and assume that there is nondifferential assessment (i.e. the falsepositive and falsenegative rates are the same in all treatment arms), as would typically be the case in a doubleblind trial or in trials where all assessors are blinded to treatment allocation.
We begin by describing a relevant example from clinical trials in ulcerative colitis. We then discuss the notation to be used, followed by an overview of the statistical properties of several different adjudication approaches. We then describe a simulation study comparing the different adjudication approaches. Key terminology used in this paper is listed in Table 2.
Clinical trials in ulcerative colitis
Ulcerative colitis is a chronic, relapsing disorder of the colon resulting from an excessive immune response against environmental antigens, although the aetiology is unknown. Patients typically present with symptoms of abdominal pain, bleeding and diarrhoea. Most clinical trials for ulcerative colitis require adjudication of endoscopy results (an examination of the patient’s colon using a fibreoptic camera) to determine whether there is evidence of disease activity (e.g. ulceration, inflammation, bleeding). The results of the endoscopy are used to inform the outcome of clinical remission (yes vs. no), either alone or in conjunction with other information, such as patientreported symptoms and a global physician assessment [16].
The results from the endoscopy could be adjudicated based on a number of different approaches. For example, the results could be adjudicated by the local onsite clinician who performs the endoscopy (site assessor), or by a clinician not based at the site who adjudicates the outcome based on a review of the endoscopic video (central assessor).
Furthermore, a central assessor, if used, could adjudicate either all patients or only a subset of patients. For example, one adjudication strategy could be to have a site assessor perform an initial adjudication, and then have a central assessor to adjudicate suspected events only. Additionally, under central adjudication, any number of assessors could be used. For example, instead of using only one assessor, three assessors could be used. This would typically involve two assessors performing an initial adjudication, and if they disagree, a third assessor acting as a tiebreak. This approach could be generalized to any odd number of assessors, e.g. 3, 5, 7, 9, etc. An additional challenge in using central adjudication is whether to only include the assessments made by the central assessor(s), or to use a hybrid approach, which also includes the assessment made by the site assessor.
Finally, under central adjudication with multiple assessors, the adjudication could be done either, independently by each assessor, or all together, as a group (if the practicalities of the trial allow for this). Under the independent assessment approach, each assessor performs the adjudication independently, without knowledge of results from the other assessors; the final classification is then based on a ‘majority rules’ approach. For example, with three assessors, the outcome is classified as an event if the majority of assessors (two or more) adjudicate an event (and vice versa for no event). Conversely, the group approach would typically involve an endpoint review committee, in which all assessors meet and conduct the adjudication together.
In the following sections, we outline some of the statistical considerations of these differing adjudication approaches. We then discuss some considerations for choosing an adjudication approach in ulcerative colitis clinical trials later in the paper.
Notation
Before considering any of the listed adjudication scenarios in detail, we outline some notation to be used. Let Y _{ i }denote the true outcome for the ith patient, with Y _{ i } = 1 indicating that an event occurred and Y _{ i } = 0 indicating that no event occurred. Assume that there are j different assessors; then, let Y _{ ij } ^{*} denote the adjudicated outcome for the ith patient from the jth assessor: Y _{ ij } ^{*} = 1 means that the assessor has adjudicated an event, and Y _{ ij } ^{*} = 0 means that the assessor has adjudicated no event. Finally, let Y _{ i } ^{*} denote the final adjudicated outcome for the ith patient (which is calculated based on the individual Y _{ ij } ^{*}s).
Then
denotes the falsepositive rate (the probability of a nonevent being incorrectly classified as an event), and
denotes the falsenegative rate (the probability of an event being incorrectly classified as a nonevent).
Let
where β is the true treatment effect, and p _{1} and p _{0} are the true probabilities of an event in the intervention and control arm respectively. Therefore, β represents the difference in proportions between treatment arms. The estimated treatment effect, \( \widehat{\beta} \), is
where \( {\widehat{p}}_1 \) and \( {\widehat{p}}_0 \) are the estimated event rates in the intervention and control arms, respectively.
In the presence of misclassification,
indicating that the estimated treatment effect will always be closer to zero (indicating no difference between treatment groups) than the true difference, and that if β = 0 (indicating no difference between treatment groups) then \( \widehat{\beta}=\beta \), and our estimate will be unbiased.
Finally, we define the infraclass correlation (ICC) as
where P(Y _{ ij } ^{*} = Y _{ ij′} ^{*}) denotes the probability that two different assessors make the same classification for the same patient, and P(Y _{ ij } ^{*} = Y _{ i′j′ } ^{*}) denotes the probability that two different assessors make the same classification for different patients [17]. The ICC measures how similar classifications from different assessors are; a value of zero indicates that assessors are no more likely than chance to make the same assessment, whereas a value of one indicates that assessors always make the same assessment. This measure is equivalent to the kappa index [17, 18], which is used to quantify the interrater agreement between different assessors.
Overview of different scenarios
How many assessors should we use?
We begin with the simplest question – if we were using only central assessors, how many assessors should we use? For the moment, we assume that adjudication will be done independently by each assessor, and that the final classification will be based on a ‘majority rules’ approach. This approach requires an odd number of assessors (though could initially be undertaken by fewer assessors, with the final assessor only called on, if required, to break a tie).
Adjudication requires time, effort, and other resources [9, 14, 19]; each additional assessor will come at a cost. In some trials, this could be substantial [14]. Therefore, the number of assessors should only be increased when the benefit outweighs the costs.
The potential advantage of increasing the number of assessors can be seen in the following example. Assume that all assessors have a falsepositive rate of 10% (i.e. 10% of the time they will falsely classify a nonevent as an event). Furthermore, let us assume that results from each assessor are completely independent (i.e. that ICC = 0, and that assessors are no more likely than chance to make the same classification). Then, using only one assessor will lead to an overall falsepositive rate of 10%. However, with three assessors, a falsepositive will only occur if two or more assessors get the classification wrong. Therefore, with three assessors, the overall falsepositive rate can be calculated as:
i.e. using three assessors can reduce the overall falsepositive rate from 10% to only 2.8%, which could substantially reduce bias and increase power.
However, if we make the alternate assumption that ICC = 1 (i.e. that assessors will always make the same classification), then all three assessors would essentially function as one. In this situation, increasing the number of assessors from one to three has no impact, and the overall falsepositive rate will be unchanged at 10%. Thus, increasing the number of assessors can reduce misclassification, thereby reducing bias and increasing power, but the extent of this benefit will depend on the correlation between different assessors; the higher the correlation, the less incremental benefit of adding extra assessors.
Should we use site or central assessors?
Assuming that adjudication could be performed by either site assessors or central assessors (and that both parties are blind to treatment allocation), a natural question is which type of assessor to use. Central assessment typically requires additional resources compared with site assessment. For example, information needs to be collated and sent to the central assessors, who then need to take time out to perform the adjudication, often in a timesensitive manner. Furthermore, specialized equipment may sometimes be required to record the event. Conversely, site assessment can be done on the spot, and is typically (though not always) done routinely. Therefore, using central assessment usually carries a resource burden, and should only be used when the benefit of doing so outweighs the costs.
In practice, central assessment will benefit the trial when central assessors have lower misclassification rates than site assessors. Therefore, when central assessors are expected to be better at classifying outcomes, central assessment should be the preferred option. Conversely, when site assessors are expected to have better classification rates, they should be the preferred option. When the classification rates are expected to be approximately the same between site and central assessors, then either option could be used.
In practice, however, the comparative misclassification rates of site and central assessors are often unknown and may be difficult to predict. In this situation, it is not clear which adjudication approach to use. We could then guess at which type of assessor is likely to be better, based on the specific trial characteristics. For example, for outcomes that are easier to adjudicate and do not require much experience or specific training (e.g. whether a patient is able to walk 10 m across a room unaided), site and central assessors are likely to be comparable, and so using a site assessor to reduce resource burden makes sense. Conversely, for more challenging outcomes, which require experience and specific training (e.g. occurrence of myocardial infarction), using central assessors may be the best approach, as it will be easier to ensure that a small number of central assessors are sufficiently experienced and have received appropriate and consistent training than to ensure the same for a large number of site assessors.
An alternative option, when the comparative misclassification rates of site and central assessors are unknown, is to use a combination approach to adjudication; instead of using either site or central assessors, both could be used. For example, the independent adjudication of both the site and a central assessor could be used, with a second central assessor (a third overall assessor) used if a tiebreak is required.
Under the assumption of independence between assessors (i.e. ICC = 0), we can derive the overall falsepositive rate for the approach of using one site assessor and two central assessors as follows. Let φ _{SA} and φ _{CA} denote the falsepositive rates of the site and central assessors respectively, and θ _{SA} and θ _{CA} denote their falsenegative rates. Then, the falsepositive rate for the approach of using one site assessor and two central assessors (denoted φ _{1SA + 2CA}) is:
The falsenegative rate is derived similarly. We can see in Fig. 1 that this approach generally compares favourably with other potential approaches (i.e. either one site or one central assessor, or three central assessors) across a range of scenarios.
Should central assessors adjudicate all outcomes, or only suspected events?
A common adjudication strategy is for site assessors to perform an initial adjudication of all outcomes, and then for central assessors to readjudicate suspected events. This strategy is outlined in Fig. 2; if the site assessor adjudicates no event, the outcome is classified as no event. If the site assessor adjudicates an event, the outcome is passed on to the central assessor(s), who will readjudicate it, with their decision being final. Under this approach, the outcome will be classified as an event if both the site and central assessors adjudicate an event, and it will be classified as no event if either the site or central assessors adjudicate no event. We refer to this adjudication approach as a twostage approach in this paper.
The twostage approach is likely to be used in situations where it is believed that central assessors have better classification rates than site assessors, but central assessors come with a cost or resource burden, and so it is more economical to only have central assessors adjudicate a subset of outcomes in order to reduce costs. Therefore, a relevant question is: Under which circumstances is it acceptable to have central assessors only assess suspected events, rather than all outcomes? In the remainder of this section, we compare the two approaches (the twostage approach vs. having central assessors adjudicate all outcomes).
Under the assumption of independence between site and central assessors (i.e. ICC = 0), we can derive the overall falsepositive and falsenegative rates under the twostage approach as follows. Let φ _{SA} and φ _{CA} denote the falsepositive rates of the site and central assessors respectively, and θ _{SA} and θ _{CA} denote their falsenegative rates. Then, the falsepositive and falsenegative rates for the twostage approach (denoted φ _{2stage} and θ _{2stage}, respectively) can be calculated as
and
Because φ _{SA}, φ _{CA}, θ _{SA} and θ _{CA} are all between 0 and 1, this indicates that
and
That is, the twostage approach will always have a falsepositive rate that is less than or equal to that of a site or central assessor alone; however, it will always have a falsenegative rate that is equal or higher. Therefore, the twostage approach will only be as good as or better than using a central assessor to adjudicate all outcomes when the reduction in the falsepositive rate is equal to or greater than the increase in the falsenegative rate, i.e. when
This expression reduces to
This expression relies on several items; however, we can see that a primary factors that will influence whether the twostage approach is useful is the falsenegative rate of the site assessors; when this is very small, using the twostage approach is likely to be beneficial. However, the assumption that site assessors will have very low falsenegative rates is contradictory to the assumption behind the twostage approach, which is that the site assessor’s misclassification rates are high enough that suspected events require further adjudication by central assessors to correct any mistakes the site assessor has made.
This highlights the contradiction in logic behind the twostage approach; that site assessors cannot be trusted to determine whether an event occurred, but can be trusted to determine whether an event did not occur. This approach assumes that correctly classifying events is more important than correctly classifying nonevents, which is not the case; as seen in formula (1), misclassification of both events and nonevents has the same effect on bias.
The bias of the twostage approach (relative to having a central assessor adjudicate all events) is shown in Fig. 3 for different values of the site assessor’s falsenegative rate. This graph reflects the situation where the central assessor has lower rates of misclassification than the site assessor (as would typically be assumed when this approach is used). From the graph, we can see that the twostage approach is only as good as having a central assessor adjudicate all outcomes when the site assessor’s falsenegative rate is close to zero.
From Fig. 3, we can see that, rather than have site assessors pass on outcomes they think are events to central assessors, a better approach would be to have them pass on all outcomes apart from those they can rule out as definitive nonevents (i.e. they would still pass on outcomes they think are nonevents, provided there is any possibility that it could be an event). By lowering the threshold at which outcomes are sent to the central assessors, the site assessor’s falsenegative rate is minimized, allowing central assessors to adjudicate relevant outcomes (i.e. all outcomes except those which are definitely not events) while still reducing the overall burden compared with having the central assessors adjudicate all outcomes. However, this approach does rely on site assessors being definitely able to identify outcomes that are not events; if this is unlikely to be the case, this modified approach is unlikely to be useful.
Independent vs. group adjudication for central assessors
If multiple central assessors are used for adjudication, they could either conduct the adjudication independently (i.e. each could adjudicate alone, with no knowledge of results from other assessors), or they could perform the adjudication in a group if trial practicalities permit (e.g. as part of an endpoint review committee), where each patient is discussed, and the final classification is decided together. Under group adjudication, the outcome could either be decided by a vote (with a ‘majority rules’ approach), by consensus, with disagreements resolved by discussion (although in practice, achieving consensus between assessors for all patients may be impossible), or through another method (e.g. a Delphi process [20, 21]).
Group adjudication may require additional resources compared with independent adjudication, for example, additional travel, time, or communication costs. Therefore, it is useful to know when we might expect an improvement in classification under the group approach, in order to determine whether the additional resource and logistical requirements are worthwhile.
Depending on the dynamics of the group, we might reasonably expect group adjudication to either increase or decrease the classification rates compared with independent adjudication and, potentially, to increase the correlation between assessors. For example, there may be one dominant voice or an opinion leader in the group that the other assessors follow; this would have the effect of increasing correlation between individuals (as the group would effectively act as one voice). Furthermore, if the dominant voice belongs to an assessor with poor classification rates, this could increase the overall misclassification rate compared with independent adjudication. Conversely, if the dynamic is one of collaboration and engagement, a discussion of each patient may allow the group to arrive at the correct classification more often than if the assessors had done the adjudication alone.
In practice, group adjudication may be preferred if it is expected that this would lead to better classification rates than achieved for independent adjudication. However, this must be balanced against the potential increase in correlation between assessors from group adjudication; as seen earlier, the higher the correlation, the less benefit there is to using more than one assessor. Therefore, if group adjudication increases the classification rate by a small margin, but leads to a substantial increase in the correlation, it may be that using independent adjudication actually leads to slightly better classification rates overall.
Simulation study
We conducted a simulation study to compare the different adjudication approaches across these scenarios. As denoted, Y _{ i } represents the true outcome for the ith patient, with Y _{ i } = 1 indicating that an event occurred, and Y _{ i } = 0 indicating that no event occurred. Y _{ ij } ^{*} represents the adjudicated outcome for the ith patient from the jth assessor. Y _{ ij } ^{*} = 1 indicates that the assessor has adjudicated an event, and Y _{ ij } ^{*} = 0 indicates that the assessor has adjudicated no event. Finally, Y _{ i } ^{*} represents the final adjudicated outcome for the ith patient (which is calculated based on the individual values of Y _{ ij } ^{*}).
We generated the true patient outcomes from the following model:
where α is the probability of an event in the control arm, X _{ i } represents the treatment group to which the ith patient was allocated (X _{ i } = 1 represents the intervention group, and X _{ i } = 0 represents the control group), and, as before, β is the difference in proportions between the intervention and control arm (representing the treatment effect).
For all simulation scenarios we set n = 1000 (500 patients in each of the two treatment arms), and α = 0.5 (denoting a 50% event rate in the control). We set β = −0.088 (indicating an event rate of 41.2% in the intervention arm, representing a difference in percentage points of −8.8%). We chose these values of α and β to give 80% power under perfect adjudication.
We generated the adjudicated outcomes from central assessors using a betabinomial distribution with a specified ICC (representing the degree of correlation or agreement between different central assessors), and with a specified falsepositive and falsenegative rate. For scenarios that included a site assessor, we generated adjudicated outcomes from the site assessor based on the same betabinomial model as for the central assessor, but using the specified falsepositive and falsenegative rates for the site assessor. This means that the correlation between site and central assessors will be less than the correlation between different central assessors, owing to the differences in the falsepositive and falsenegative rates.
Details of how we calculated an overall adjudicated outcome (Y _{ i } ^{*}) for each scenario are provided next. Once we calculated Y _{ i } ^{*} for each patient, we used these adjudicated outcomes to estimate \( \widehat{\beta} \). This was done using a generalised linear model with a binomial family and an identity link; we used Y _{ i } ^{*} as the outcome, and the treatment arm (X _{ i }) as the only covariate.
Then, for each different adjudication approach, we estimated the following:

Percentage bias in \( \widehat{\beta} \)

Power
The percentage bias in \( \widehat{\beta} \) was calculated as
Power was defined as the percentage of replications for which p < 0.05.
For all simulation scenarios, we used 10,000 replications. We provide further details on each specific simulation scenario next.
How many assessors should we use?
We evaluated using either one, three, five, or seven central assessors. With multiple assessors, the final classification of the outcome was based on the adjudication of the majority of assessors (e.g. with three assessors, the final classification was ‘event’ if two or more assessors adjudicated ‘event’, and vice versa). With only one assessor, the final classification of the outcome was set to the one assessor’s adjudicated outcome. Formally:
where m represents the total number of assessors used.
We varied the correlation between assessors (ICC) and the assessors’ falsepositive and falsenegative rates (φ and θ). We set the falsepositive and falsenegative rate to be equal for each assessor, e.g. if φ was 5% then θ was also set to 5%. We varied the ICC between 0.10, 0.25, 0.50, 0.75 and 0.90, and varied φ and θ between 5%, 10%, 15% and 20%. This led to 5 × 4 = 20 total scenarios.
Should we use site or central assessors?
We evaluated four different approaches of adjudication:

One site assessor

One central assessor

Three central assessors

One site assessor and two central assessors
For the approach with three central assessors, the final classification of the outcome was calculated as before (i.e. based on the majority of adjudications). For the approaches with one site assessor or one central assessor, the final classification was based on the assessor’s adjudicated outcome. For the approach with one site assessor and two central assessors, the final classification of the outcome was based on the adjudication of the majority of assessors (the same method used for three central assessors), regardless of whether they were site or central assessors. For example, the final classification was an ‘event’ if two or more assessors adjudicated ‘event’, regardless of whether these were the two central assessors, or one central assessor and one site assessor.
We varied the ICC and the falsepositive and falsenegative rates for the different assessors. We varied the ICC between 0.10, 0.25, 0.50, 0.75 and 0.90. We used three different scenarios for the falsepositive and falsenegative rates of the assessors:

Scenario 1: central assessors have lower falsepositive and falsenegative rates:

○ Site assessor: φ _{SA} = 20%, θ _{SA} = 20%

○ Central assessor: φ _{CA} = 10%, θ _{CA} = 10%


Scenario 2: site assessors have lower falsepositive and falsenegative rates:

○ Site assessor: φ _{SA} = 10%, θ _{SA} = 10%

○ Central assessor: φ _{CA} = 20%, θ _{CA} = 20%


Scenario 3: site assessors have lower falsenegative rates, central assessors have lower falsepositive rates:

○ Site assessor: φ _{SA} = 20%, θ _{SA} = 10%

○ Central assessor: φ _{CA} = 10%, θ _{CA} = 20%

This led to 5 × 3 = 15 total scenarios.
Should central assessors adjudicate all outcomes, or only suspected events?
We evaluated three different approaches to adjudication:

One site assessor

One central assessor

Twostage approach
We simulated a scenario where the central assessor generally had better classification rates than the site assessor (as would generally be the assumption if a twostage adjudication approach were adopted).
For the central assessor, we set both the falsepositive and falsenegative rates to 5% (i.e. φ _{CA} = θ _{CA} = 5%); for the site assessor, we set the falsepositive rate to 10% (φ _{SA} = 10%). We then varied the falsenegative rate for the site assessor (θ _{SA}) between 5%, 10%, 15% and 20%. We varied the ICC between 0.10, 0.50 and 0.90 (note that this ICC is conditional having the same misclassification rate, and so is not exact between site and central assessors).
This led to 4 × 3 = 12 total scenarios.
Independent vs. group adjudication for central assessors
We explored the scenario where multiple central assessors will adjudicate the outcome, and evaluated two different approaches:

Central assessors adjudicate the outcome independently (independent adjudication).

Central assessors adjudicate the outcome together in a group (group adjudication).
For each method, we used three central assessors. Final classification of the outcome for independent adjudication was based on the majority. For group adjudication, we assumed an approach where assessors would discuss each outcome before voting, with the final classification being based on the majority.
We assumed that group adjudication could change the classification rate compared with independent adjudication (either increase or decrease it), and could increase the correlation between different assessors.
We set the falsepositive and falsenegative rates to 20%, and the ICC to 0.50 for independent adjudication. Under group adjudication, we varied the change in the falsepositive and falsenegative rates between −2%, −1%, 0%, 1% and 2%. A change of −2% indicates that using group adjudication led to a decrease of 2% for both the falsepositive and falsenegative rate compared with independent adjudication (i.e. the rates were reduced from 20% to 18%). We varied the change in ICC under group adjudication between 0, 0.1, 0.2, 0.3 and 0.4. A change of 0.4 indicates that using group adjudication led to an increase in the ICC of 0.4 compared with independent adjudication (i.e. the ICC was increased from 0.50 to 0.90).
This led to 5 × 5 = 25 total scenarios.
Results
How many assessors should we use?
Results are shown in Figs. 4 and 5. Results were similar across all falsepositive and falsenegative rates. Increasing the number of assessors can reduce bias and increase power; however, the extent of this benefit depends on the true ICC; as the ICC increases, the size of the benefit is reduced. For example, for falsepositive and falsenegative rates of 20%, Fig. 5 shows that when the ICC is 0.10, increasing the number of assessors from one to three will increase the power from 40.1% to 55.9%. However, when the ICC is 0.90, going from one to three assessors only increases the power from 39.6% to 39.7%.
Furthermore, the benefit of adding extra assessors depends on how many assessors there are to start with. For example, for falsepositive and falsenegative rates of 20%, Fig. 5 shows that increasing the number of assessors from one to three (an increase of two assessors) leads to an increase in power of 15.8% for an ICC of 0.10; however, increasing the number of assessors from five to seven (also an increase of two assessors) leads to a much smaller increase in power of 3.7%.
Should we use site or central assessors?
Results are shown in Figs. 6, 7 and 8. Figure 6 shows results for the scenario where central assessors have lower falsepositive and falsenegative rates than site assessors. In this scenario, using one site assessor is the worst option. For lower ICCs, using three central assessors is the best option, followed closely by using one site assessor and two central assessors. For very high ICCs, there was little difference between any of the approaches involving central assessors.
Figure 7 shows results for the scenario where site assessors have lower falsepositive and falsenegative rates than central assessors. In this scenario, using one site assessor is generally the best approach, followed by using one site assessor and two central assessors. For very high ICCs, there was little difference between any of the approaches involving central assessors.
Figure 8 shows results for the scenario where central assessors have lower falsepositive rates, and site assessors have lower falsenegative rates. Using one site assessor and two central assessors was the best approach across all scenarios, followed by using three central assessors. For very high ICCs, there was little difference between any of the approaches.
Should central assessors adjudicate all outcomes, or only suspected events?
Results are shown in Fig. 9. The twostage approach (having a central assessor only adjudicate suspected events) was only as good as having a central assessor adjudicate all events when the falsenegative rate of the site assessor was very low. When the site assessor’s falsenegative rate was high, the twostage approach led to a large increase in bias compared with having a central assessor adjudicate all outcomes.
Independent vs. group adjudication for central assessors
Results are shown in Fig. 10. Using group adjudication (e.g. an endpoint review committee) was beneficial when it decreased the misclassification rates compared with individual adjudication. Small reductions in the misclassification rate could sometimes be outweighed by very large increases in correlation between assessors, but this was not an issue for larger reductions. Independent adjudication was preferable when group adjudication led to an increase in misclassification rates, or had no impact on misclassification rates but led to an increase in the correlation between assessors.
Choosing an adjudication approach for clinical trials in ulcerative colitis
As discussed previously, the key factors that will influence the choice of adjudication approach are (i) the comparative misclassification rates of the site and central assessors; (ii) the degree of correlation between assessors and (iii) the additional resource burden associated with central assessment.
Comparative misclassification rates between site and central assessors
For trials in ulcerative colitis, there is no gold standard method of adjudication, and it is therefore impossible to estimate the exact misclassification rates of the site and central assessors (as we have no way of determining what the ‘true’ outcome is). However, some published data on the rates of agreement between site and central assessors suggest that the two types of assessor disagree in up to 30% of cases [22, 23]. This is primarily due to site assessors adjudicating higher levels of disease severity than central assessors [22, 23]. It has been suggested that these rates of disagreement are primarily due to misclassification on the part of site assessors, which may be due to their knowledge of patient symptoms [22,23,24]. For example, if site assessors are aware that the patient is feeling ill, they may subconsciously assume that the endoscopy results must be poor, and may therefore provide a more severe adjudication of disease severity.
Degree of correlation between assessors
Published data suggest that there is generally moderatetohigh correlation between central assessors (range 0.50–0.83) [22, 25].
Resource burden associated with central assessment
Central assessors would generally adjudicate an endoscopic video through a web portal. The resource burden associated with this is primarily driven by the fact that the endoscopy needs to be videoed at the enrolling clinical site, the video needs to be uploaded into a secure web portal that meets regulatory requirements, a coordinating image management organization must operationalise the review and distribution of images to central assessors, and the central assessors typically need to be paid for their time.
Recommendations
Based on the rate of disagreement between site and central assessors, the correlation between assessors, and the resource requirements for central assessment, several adjudication strategies could be proposed. If there is evidence to indicate that the disagreement between site and central assessors is primarily due to misclassification on the part of site assessors, then using only central assessors would be the preferred option. Because the estimated correlation between central assessors is high (>0.50), using more than three assessors would confer little benefit in terms of reducing bias; therefore, using one or three assessors is reasonable. If three assessors were to be used, we would recommend using independent (rather than group) adjudication, as group adjudication would lead to a large increase in resources, without necessarily improving the classification rate.
If we believe that the disagreement between site and central assessors is not primarily due to misclassification by the site assessors (e.g. if we believe it was due to misclassification by both site and central assessors) then using a combination approach (one site assessor and two central assessors) may be the preferred option. As before, we would use independent (rather than group) adjudication here.
Regardless of which adjudication approach is chosen, all assessors should be provided training if necessary, and it may be useful to have them adjudicate a number of test cases before the trial starts to ensure adjudication is performed in a consistent manner.
Discussion
Outcome misclassification can lead to biased treatment effect estimates and reduced power, potentially resulting in an erroneous conclusion regarding treatment efficacy. The implementation of strategies to reduce misclassification is therefore of critical importance in clinical trials. We found that the choice of adjudication approach can have a large impact on trial results, and should be given careful consideration in the planning stages of a clinical trial. We found that no one approach is optimal across all scenarios; instead, the best approach will depend on the specific trial characteristics, notably the misclassification rates of the site and central assessors, and the correlation between assessors. The resource implications of the different approaches is also worth considering.
Our conclusions are summarized in Table 3. Regardless of the adjudication approach chosen, using qualified assessors and providing sufficient training is likely to be of key importance in improving classification rates [3, 15]. For example, in clinical trials of ulcerative colitis assessors could be asked to adjudicate a series of endoscopy videos from a training library prior to starting the trial, to ensure adjudication is being done in an appropriate and consistent way with formal statistical assessment of interrater agreement; furthermore, throughout the trial, further assessments can be conducted to ensure agreement remains within a prespecified standard with retraining of assessors not meeting those standards. It is also worth noting that in some cases it may be necessary to also provide training for those who are compiling or documenting the evidence that is to be used by the assessor [3]. For example, for imaging outcomes, it may be useful to train those who take the image and to implement quality control processes to ensure that the images are acquired through standardized protocols and are of appropriate quality for adjudication.
Conclusion
Outcome adjudication is of critical importance to ensure validity of trial results, although no one approach is optimal across all settings. Investigators should choose the best strategy based on the specific characteristics of their trial. Regardless of the adjudication strategy chosen, assessors should be qualified and receive appropriate training.
Abbreviations
 CA:

central assessor
 ICC:

intraclass correlation coefficient
 SA:

site assessor
References
 1.
Bross I. Misclassification in 2 × 2 tables. Biometrics. 1954;10(4):478–86.
 2.
Kim MY, Goldberg JD. The effects of outcome misclassification and measurement error on the design and analysis of therapeutic equivalence trials. Stat Med. 2001;20(14):2065–78.
 3.
BeyerWestendorf J, Halbritter K, Platzbecker H, Damme U, Neugebauer B, Kuhlisch E, et al. Central adjudication of venous ultrasound in VTE screening trials: reasons for failure. J Thromb Haemost. 2011;9(3):457–63.
 4.
Hata J, Arima H, Zoungas S, Fulcher G, Pollock C, Adams M, et al. Effects of the endpoint adjudication process on the results of a randomised controlled trial: the ADVANCE trial. PLoS One. 2013;8(2):e55807.
 5.
Heckbert SR, Kooperberg C, Safford MM, Psaty BM, Hsia J, McTiernan A, et al. Comparison of selfreport, hospital discharge codes, and adjudication of cardiovascular events in the Women’s Health Initiative. Am J Epidemiol. 2004;160(12):1152–8.
 6.
Mahaffey KW, Harrington RA, Akkerhuis M, Kleiman NS, Berdan LG, Crenshaw BS, et al. Systematic adjudication of myocardial infarction endpoints in an international clinical trial. Curr Control Trials Cardiovasc Med. 2001;2(4):180–6.
 7.
Mahaffey KW, Roe MT, Dyke CK, Newby LK, Kleiman NS, Connolly P, et al. Misreporting of myocardial infarction end points: results of adjudication by a central clinical events committee in the PARAGONB trial. Second Platelet IIb/IIIa Antagonist for the Reduction of Acute Coronary Syndrome Events in a Global Organization Network Trial. Am Heart J. 2002;143(2):242–8.
 8.
Naslund U, Grip L, FischerHansen J, Gundersen T, Lehto S, Wallentin L. The impact of an endpoint committee in a large multicentre, randomized, placebocontrolled clinical trial: results with and without the endpoint committee’s final decision on endpoints. Eur Heart J. 1999;20(10):771–7.
 9.
Ninomiya T, Donnan G, Anderson N, Bladin C, Chambers B, Gordon G, et al. Effects of the end point adjudication process on the results of the Perindopril Protection Against Recurrent Stroke Study (PROGRESS). Stroke. 2009;40(6):2111–5.
 10.
Petersen JL, Haque G, Hellkamp AS, Flaker GC, Mark Estes 3rd NA, Marchlinski FE, et al. Comparing classifications of death in the Mode Selection Trial: agreement and disagreement among site investigators and a clinical events committee. Contemp Clin Trials. 2006;27(3):260–8.
 11.
Pogue J, Walter SD, Yusuf S. Evaluating the benefit of event adjudication of cardiovascular outcomes in large simple RCTs. Clin Trials. 2009;6(3):239–51.
 12.
Serebruany VL, Atar D. Viewpoint: central adjudication of myocardial infarction in outcomedriven clinical trials – common patterns in TRITON, RECORD, and PLATO? Thromb Haemost. 2012;108(3):412–4.
 13.
Vranckx P, McFadden E, Cutlip DE, Mehran R, Swart M, Kint PP, et al. Clinical endpoint adjudication in a contemporary allcomers coronary stent investigation: methodology and external validation. Contemp Clin Trials. 2013;34(1):53–9.
 14.
Walter SD, Cook DJ, Guyatt GH, King D, Troyan S. Outcome assessment for clinical trials: how many adjudicators do we need? Canadian Lung Oncology Group. Control Clin Trials. 1997;18(1):27–42.
 15.
Wilson JT, Slieker FJ, Legrand V, Murray G, Stocchetti N, Maas AI. Observer variation in the assessment of outcome in traumatic brain injury: experience from a multicenter, international randomized clinical trial. Neurosurgery. 2007;61(1):123–8. Discussion 8–9.
 16.
Schroeder KW, Tremaine WJ, Ilstrup DM. Coated oral 5aminosalicylic acid therapy for mildly to moderately active ulcerative colitis. A randomized study. N Engl J Med. 1987;317(26):1625–9.
 17.
Eldridge SM, Ukoumunne OC, Carlin JB. The intracluster correlation coefficient in cluster randomized trials: a review of definitions. Int Stat Rev. 2009;77(3):378–94.
 18.
Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: Wiley; 1981.
 19.
Granger CB, Vogel V, Cummings SR, Held P, Fiedorek F, Lawrence M, et al. Do we need to adjudicate major clinical events? Clin Trials. 2008;5(1):56–60.
 20.
Duffield C. The Delphi technique: a comparison of results obtained using two expert panels. Int J Nurs Stud. 1993;30(3):227–37.
 21.
Keeney S, Hasson F, McKenna H. Consulting the oracle: ten lessons from using the Delphi technique in nursing research. J Adv Nurs. 2006;53(2):205–12.
 22.
Feagan BG, Sandborn WJ, D’Haens G, Pola S, McDonald JW, Rutgeerts P, et al. The role of centralized reading of endoscopy in a randomized controlled trial of mesalamine for ulcerative colitis. Gastroenterology. 2013;145(1):149–57. e2.
 23.
Hebuterne X, Lemann M, Bouhnik Y, Dewit O, Dupas JL, Mross M, et al. Endoscopic improvement of mucosal lesions in patients with moderate to severe ileocolonic Crohn’s disease following treatment with certolizumab pegol. Gut. 2013;62(2):201–8.
 24.
Panes J, Feagan BG, Hussain F, Levesque BG, Travis SP. Central endoscopy reading in inflammatory bowel diseases. J Crohns Colitis. 2016;10 Suppl 2:S542–7.
 25.
Travis SP, Schnell D, Krzeski P, Abreu MT, Altman DG, Colombel JF, et al. Reliability and initial validation of the ulcerative colitis endoscopic index of severity. Gastroenterology. 2013;145(5):987–95.
Acknowledgements
Not applicable.
Funding
This project was not externally funded.
Availability of data and materials
Not applicable.
Authors’ contributions
Conception: BK, BF, VJ. Design: BK, VJ. Writing: BK, BF, VJ. All authors read and approved the final manuscript.
Authors’ information
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Randomized controlled trial
 Outcome adjudication
 Outcome assessment
 Endpoint adjudication committee
 Endpoint review committee
 Misclassification
 Central assessor
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate. Please note that comments may be removed without notice if they are flagged by another user or do not comply with our community guidelines.