Crisis, Replication and A/B Testing

The ‘Reproducibility Crisis’ is common knowledge in the scientific community. Most businesses, however, haven’t even begun to recognise the problem.

Richard J Brooker

Follow

Published in

Tech @ Careem

7 min readMay 4, 2020

--

Background

“Science is in a methodological crisis” — this seems to be a common sentiment among academics.

In a survey of 1,500 researchers, 70% of them said they had failed to replicate another scientist’s experiment [1]. In a 2009 study, 14% of scientists admitted to personally knowing someone who had falsified results [2]. A 2019 meta-study of deep learning in recommendation systems suggested that less than 40% of papers were replicable [3]. Countless other studies have gone on to confirm similar findings.

Survey of 1,576 researchers [https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970]

These findings have been extremely demotivating for researchers, especially anyone working in Social Sciences and Psychology at the time. It has weakened the position of Science in the public sphere.

What's not talked about is how incredible this realisation is, and how far ahead the Scientific Community is in identifying these problems. Most of the private sector hasn’t even begun to recognise the issue or understands its relevance.

A lot of analytical work in businesses is bizarre and senseless. Too often, employees deploy subtle scientific tools with a superficial understanding. The result is a vast waste of time, energy, and money.

Introduction

In this article, I'm going to talk about some of the lessons we are learning from academic researchers, and highlight some of the things we can do to improve analysis, experimentation, and A/B testing in our companies.

1. Stop Incentivising Bad Practises

In many companies, your promotion/remuneration/commission is directly or indirectly linked to the results of an experiment you are running. I understand the motivation for this, however, it incentivises false reporting and lazy analysis, and doesn’t reward exploration and risk-takers.

2. Education

Most business leaders don’t know what a significance test is or understand why statistical analysis is important.

Statistical Analysis can drastically change the results of your findings. For example, let's look at this revenue data for an e-commerce site.

> SELECT cohort, AVG(revenue) FROM revenue_table GROUP BY 1 ;
cohort, AVG(revenue) , COUN(custoomer_id)
treatment, 14.01 , 5723
control, 13.75, 5721

On face value, it would appear the treatment group had an increase in revenue.

However, after filtering outliers and plotting confidence intervals you can see, if anything, there is a negative impact for the majority of users.

Distribution plot of average revenue per user. Estimated using bootstrapping.

The variance of your metrics can we very high, especially if you are involved in both B2B and B2C sales.

3. Peer Reviews

Pair programming, replicability studies, and cross-department validation can help you understand if your methods and findings are valid.

4. Understand P-hacking

P-hacking is a collection of ways you can distort your analysis, by collecting or selecting data or statistical analyses until nonsignificant results become significant. Common examples are the early stopping of A/B tests and selective reporting,

A graph showing the z-score for 1000 A/B test simulations. The tests are terminated when the z-score reaches statistical significance. The result is a false-positive rate of 17%, three times higher than the intended 5%. [https://www.aarondefazio.com/tangentially/?p=83]

Evidence of selective reporting. It suggests there is a difference between p-values in the abstract compared to the p-values in the results section of research papers. [https://www.researchgate.net/figure/Evidence-for-p-hacking-across-scientific-disciplines-A-Evidence-for-p-hacking-from_fig8_273463561]

For an entertaining discussion on the P-hacking check out Last Week Tonight with John Oliver.

5. Don’t Confuse your Experiment Unit and Observation unit

A common mistake people make when running an A/B test is creating a test split on users but then treating more granular data as your independent observations.

Take a look at this conversion data. Users have been placed into random control and treatment groups.

Conversion data for 430,311 sessions and 97,245 users.

Often, the analyst’s first instinct is to model this as independent Bernoulli trials.

Smaple mean distribution of Bernoulli trials. P(control> treatment) = 93%.

However, because the split is at a user level, the data is more like a power distribution.

Histogram of the number of conversions per user.

By treating sessions as i.i.d we had tricked ourselves into thinking we had more data and less variance than we do. The ‘real’ confidence intervals are much wider than our initial assumption had us believe.

Distribution plots using user bootstrapping, P(control> treatment) = 67%.

The results are no longer statistically significant (which is good, as this was a random A/A test with no difference between control and treatment).

6. Make Sure Your Test Has Enough Power

Whether you are doing a Hypothesis test or calculating confidence intervals, you need to know what kind of impact you are able to measure and how likely you are to recognise it.

7. Deduction vs Induction

Deductive reasoning moves from idea to observation, while induction moves from observation to the idea. Both are important types of scientific reasoning.

Always think about your hypothesis and theory behind the observations you are seeing. Often having a valid theory is more important than the observations themselves.

8. Careful Calculating Rates

For example, if your treatment increases the number of users you see, then you might see a drop in revenue per user, even though total revenue is increased.

Try to look at sums and totals when possible.

9. Become a Bayesian

It's likely you are using Bayes Theorem and prior distributions already. But to really make use of the Bayesian paradigm look into Thompson Sampling.

PyData has a helpful introduction to Thompson Sampling https://www.youtube.com/watch?v=wcCSAbcj5Q0.

10. Random is Rarely Enough

A common issue in testing is "Pre-Experiment Bias". This is when, by chance, your control and treatment have a pre-existing difference. This dramatically reduces the statistical power of your tests.

Solutions are strategised sampling and propensity matching. There are more details on the subject in these posts:

However, the conversation around this topic is often ill-defined. Sometimes its talked about as a solution to measuring significance, however, it shouldn’t affect the validity of your significance test, only its power.

11. Don’t Trust Blog Posts

There is a large industry around testing, particularly in A/B tests. It's estimated that the global A/B Testing Software Market to Reach US$1,081M by 2025.

Claims-makers love to point out problems they can provide solutions to. Try to remember that what you are reading is a piece of self-promotion, not academic material.

Much of the language in online significance calculators is vague and doesn’t clarify what your observations are or how the test needs to be constructed, e.g:

12. Don’t Filter

As soon as you put a filter on your data (e.g filtering users who didn’t make it past a certain point in the conversion funnel), the test is no longer a randomised experiment.

13. Filter Outliers

Filtering outliers can help improve the power of your significance test. Keep in mind these outliers might need to be analyzed separately.

14. Multi Comparison

If you are doing multi-variant testing you might need to correct your P-values, (e.g .Bonferroni Correction). The more variants you have, the higher the chance of a false positive.

15. Be Aware of Network Effects

If you’re testing in a marketplace or social network, it is likely that the effect on your treatment will have consequences on the control. For this, you will need a network segmentation, a time-based test, or geo-spatial segmentation.

Lyft has some posts on the problem, https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e

16. Bootstrapping

Finding the right closed-form solutions for confidence-intervals/hypothesis tests is nuanced and difficult. It can require years of experience and a deep understanding of statistics.

Bootstrapping is an easy and intuitive alternative for calculating a probability distribution and confidence intervals.

Example of calculating bootstrap distributions through sampling with replacement.

17. Increase Transparency

Always share your code, data, and methodology.

Adopt a coding convention when sharing code. ‘Papers With Code’ provide a useful checklist for this, https://medium.com/paperswithcode/ml-code-completeness-checklist-e9127b168501.

18. Be Wary of Time Based Tests

Time-based switching is when your alternate versions of your app/website/etc based on hourly windows.

The number of observations you have is the number of time windows — this is usually very small. You will need very large shifts in metrics in order to see any statistical significance.

Results of a time-based A/A test simulation. It shows you would need a 2% increase in conversion rate to be sure of any statistical significance.

19. Check Tracking

Often with new app releases come improvements in tracking. This will affect what kind of metrics you can compare.

Conclusion

This article is far from extensive. Each experiment is its own piece of analysis with 100 different pitfalls each time.

Do not underestimate how difficult experimentation is.