Researcher degrees of freedom


If you torture the data long enough, Nature will confess.

– Ronald Coase

The abundance of data and data manipulation tools has made it easier than ever to find interesting results in data. However, there is a danger that many of these results are spurious, because of the way it has been analysed. This post looks at how this plays out in data science.

The problem in social science

In recent years, there have been concerted efforts to investigate a replication crisis in the social sciences. Concerns have been raised because many results (including some thave have been in textbooks for decades) could not be replicated and have been brought into question. The problem is so prevalent that, at the time of writing, the second most popular TED talk of all time is based on questionable research.

So how does such a situation come about?

One type of bias, known as p-hacking, occurs when data is selected and reanalysed in many different ways, until you get the result you want. This issue has known about since 1961, but remains one of many data analysis pitfalls that it is easy to fall into. This is because it modern computing allows researchers to make multiple comparisons across datasets easily.

The funding of grants depends on a researcher’s publication record, which in turn depends on how many novel, statistically significant results they contribute. When there are incentives to produce significant results, there is an increased risk of (unconscously) falling into the p-hacking trap.

The problem in data science

If researchers in academia who are incentivised to publish novel, statistically significant results are at risk of falling into these statistical traps, then it’s easy to see how data scientists in business (driven by simliar incentives) might also be at risk.

Typically, an analyst will be asked to ‘slice and dice’ the data in a myriad of different ways until they find ‘actionable insights’. For example, if nothing interesting can be found at the overall level, they may segregate the data by other variables (eg. age, gender, location) to see if there are any localised effects.

As an aside, the commonly used phrase of ‘actionable insight’ is also problematic. For anything to be ‘actionable’, it must relate to something the organisation can control. For example, if a product is found to be especially popular with wealthy baby boomers, you can’t increase sales by increasing the number of wealthy baby boomers in the world. Furthermore, an ‘insight’ implies a clear understanding of something that wasn’t clear before. Changes in customer behaviour tend to be small effects, and require large experimental sample sizes. Conversely, the types of effects that look dramatic on plots are usually things that are already obvious.

In our data ‘slicing and dicing’ scenario, the problem is that if the significance threshold is p = 0.05, it means that even if there really is no effect, the chance of finding a ‘statistically significant’ result is still 5%. If the analyst continues to re-analyse the data in different ways, they will eventually find a ‘statistically significant’ result by sheer luck.

The result is that the business ends up focussing on a spurious artefact of the data (eg. customers in a particular socioeconomic segment, in a particular region appear to favour a particular line of products). The problem is made worse by the fact that overly segregating the data also creates smaller sample sizes, where it is easier to see spurious results.

Then there is the temptation is to construct a post-hoc narrative to make sense of these results, one that would not have been considered had the result had gone other way. This has been referred to as HARKing (Hypothesizing After the Result is Known).

In summary, this scenario is unlikely to produce robust findings.

This fivethirtyeight app shows how easy it is to p-hack your way to a statistically significant finding.

What to do about it

Significance testing was originally developed at a time before modern computers, and there is ongoing debate as to whether it deserves to be the default approach that it currently is. There have even been calls to abandon significance testing altogether. For data scientists, there is no simple method to guard against this, other than by maintaining a healthy skepticism, and being intellectually honest.

Finally, no discussion of p-hacking would be complete without a reference top the obligatory xkcd.