Both steps of enrollment contribute to the above noted 9% population imbalance. 4.5% more households were mapped in treatment villages than control villages in the first step. This difference in mapping behavior between treatment and control groups is significant with \(p=0.0072\), despite supposed staff blinding.
In the second step, unblinded staff obtained consent from more and larger households, increasing the final imbalance in the number of households between treatment and control groups to 8%, and the imbalance in populations to 9%. Households were recorded absent 1.4× more frequently and having no eligible household member present 2.2× more frequently in the control arm.
To evaluate the impact of this bias, we tested many of the reported outcomes in the study for significance with paired Wilcoxon signed-rank tests at the village level. In these tests, the only randomness in observation assumed for test validity is the independent random choice of a control and treatment village from each village pair. In addition to the effects of the trial’s intervention on rates of mask-wearing and physical distancing, the difference in the consent obtained by unblinded staff is among the most significant differences of any outcome difference between treatment and control. Of the outcomes we tested with Wilcoxon tests, only these three had p-values less than \(10^{-6}\) (see Fig. 1 and Table 1). In the second step of enrollment, staff was tasked with both enrolling households and providing masks in the treatment villages and hence were aware whether they were surveying a treatment or control village.
As described by the authors, this lack of staff blinding led to substantial post-randomization ascertainment bias (supplement, page 41 and Table S19). They suggest that the arm imbalance could be attributed to surveyors being more eager to enroll borderline households in the treatment villages and households volunteering individuals younger than 18 so that they could receive masks. However, in their robustness analyses of the imbalance, they are only able to account for 25% of the difference in size between treatment and control. The statistically significant differences in the supposedly blinded mapping step suggests that some unintentional unblinding may have also occurred.
Inferring causal effects in the presence of the strong effect on enrollment rates to which the findings are not robust requires assuming that borderline participants who would have been consented to participate in treatment but not control were just as likely as typical villagers to become infected with COVID, develop symptoms, and report them to study staff. More importantly, as blood draws were conditioned on symptom reporting and no bias resistant endpoints were evaluated, the substantial, highly significant effects on staff and participant behavior should caution against confident causal claims about COVID-related outcomes.
The \(9\%\) imbalance observed in population sizes arose through some combination of bias and random chance. We ran permutation tests to illustrate the relative contribution each of these factors on the imbalance. Figure 2 displays histograms generated by reassigning treatment/control within village pairs randomly, 1 million times, to generate 1 million alternative splits between treatment and control groups. The red line shows the effect size for the actual treatment/control group split. We see that it would be exceptionally rare for the population imbalance to occur by random assignment of pairs to treatment and control. On the other hand, the observed symptomatic rates and symptomatic seropositivity rates are more plausibly explained by random fluctuation. That is, the null hypothesis that the intervention had no effect on the two outcomes based on subjective surveys is more likely than the the null hypothesis that the intervention had no effect on the population imbalance. We would expect the symptomatic rates and symptomatic seropositivity rates, which are based on subjective surveys, to be more susceptible to staff and participant bias than demographic quantities like “counts of households” and “number of people in a village.” In other words, to infer strong causal effects of the intervention on the COVID-related outcomes, the bias and randomization that imbalanced enrollment population—which is definitely not a direct causal effect of mask filtration—should be at least as likely to induce similar imbalances in COVID-related outcomes.
The authors provided similar permutation tests in Appendix Fig. S2 of their paper. Here, rather than simply counting the number of seropositives in each resampling, the authors re-run their fixed-effect regression to estimate the magnitude of the masking effect. They report the one-sided p-value of 0.07 for symptomatic seropositivity. The two-sided p-value associated with the authors’ permutation test is 0.14, which is aligned with our findings. We highlight this point to note that when we examine the same outcomes as the authors, the p-values in our reanalysis are not far from those reported in the original paper. However, we found other effects that should have been non-causal that were much more highly significant, suggesting that it is difficult to disentangle the effects of differences in staff-participant interaction between groups from the direct causal effects of masks.
Figure 3 illustrates the steps leading to the final 1086:1106 split in symptomatic seroprevalence between treatment and control groups. Each circle shows how much greater or lower the transition rate is in the treatment group vs the control. The magnitude of these differences are striking: in behavioral outcomes, differences on the order of 10% were observed between the study arms. However, the same percentage of symptomatic individuals in consented to blood draws in both arms. Additionally, in the final step when blood samples are tested, there is no difference in the rate with which samples test positive for COVID-19 antibodies. It might seem surprising that the intervention’s impact on other behavioral mitigation measures such as social distancing also did not result in clear impact on symptomatic seropositivity.