There are usually many quality checks that are feasible in the online space with large-sample A/B tests. Here are a few examples:
• Checks on randomization: If the experiment design is for a ratio of one-to-one (equally sized control and treatment) then deviations in the actual ratio of users in an experiment likely indicate a problem. With large numbers, a ratio smaller than 0.99 or larger than 1.01 for a design that called for 1.0 likely indicates a serious issue. This simple test has identified numerous issues in experiments, many of which looked either great or terrible initially and invoked Twyman’s law (“Any figure that looks interesting or different is usually wrong”) for us .
• Bias assessment with A/A tests: A/A test is the same as an A/B test, but the treatment and control users receive identical experience (the same UI, or the same ranking algorithms etc.), thus differences measured by the experimental procedures reflect chance or bias. Because the null hypothesis is true by design in A/A tests, statistically significant differences for each metric should happen at about 5% when using a p value cutoff of 0.05. We can run a large number of A/A tests easily, and a higher or lower A/A failure rate for metrics would happen when the normality or independent and i.i.d. assumptions (i.e. independent and identically distributed data) are violated. A/A tests are also used to ensure reasonable balance between treatment and control users. They can be very effective at identifying biases, especially those introduced at the platform level. For example, we can use A/A tests to identify carry-over effect (or residual effect), where previous experiments would impact subsequent experiments run on the same users .
• Re-randomization or post-experiment adjustment. Randomization, while it is a great technique to remove confounding factors, is not the most efficient at times. For example, we may have more engaged users in treatment than in control just by chance. While stratification is a common technique used to improve balance across strata, it can be expensive to implement efficiently during the sampling phase. One effective approach is to check the balance of key metrics using historical data and then re-randomize using a different hash ID if the difference between the treatment and the control is too large. For instance, Microsoft has created a ‘seed finder’ that can try hundreds of seeds for the hash function to see which one leads to a difference that is not statistically significant . Another approach is to apply adjustment during the analysis phase, using post-stratification or CUPED . Netflix  has a nice comparison paper on some of these approaches.