The first paragraph describes the context of this post in relation to the blog on which it originally appeared, not
For this post, I wanted to go a little bit meta. One focus of this blog will be on whether results from academic software engineering are applicable to the work I do as a commercial software developer, so it was hard to pass up this Microsoft Research paper on representativeness of research.
In a nutshell, the problem is this: imagine that you read some study that shows, for example, that schedule slippage on a software project is significantly lessened if developers are given two digestive biscuits and a cup of tea at 4pm on working days. The study examined the breaktime habits of developers on 500 open source projects. This sounds quite convincing. If this thing with the tea and biscuits is true across five hundred projects, it must be applicable to my project, right?
That doesn’t follow. There are many reasons why it might not follow: the study may be biased. The analysis may be wrong. The biscuit thing may be significant but not the root cause of success. The authors may have selected projects that would demonstrate the biscuit outcome. Projects that had initially signed up but got delayed might have dropped out. This paper evaluates one cause of bias: the projects used in the study aren’t representative of the project you’re doing.
It’s a fallacy to assume that just because a study has a large sample size, its results can be generalised to the population. This only applies in the case that the sample represents an even slice of the population. Imagine a world in which all software projects are either written in Java or LISP. Now it doesn’t matter whether I select 10 projects or 10,000 projects: a sample of LISP practices will not necessarily tell us anything about how to conduct a Java project.
Conversely a study that investigates both Java and LISP projects can—in this restricted universe, and with the usual “all other things being equal” caveat—tell us something generally about software projects independent of language. However, choice of language is only one dimension in which software can be measured: the size, activity, number of developers, licence and other factors can all be relevant. Therefore the possible phase space of important factors can be multidimensional.
In this paper the authors develop a framework, based on work in medicine and other fields, for measuring and maximising representativeness of a sample by appropriate selection of projects along the dimensions of the problem space. They apply the framework to recent research.
What they discovered, tabulated in Table II of the paper, is that while a very small, carefully-selected sample can be surprisingly representative (50 out of ~20k projects represented ~15% of their problem space), the ~200 projects they could find analysed in recent research only scored around 9% on their representativeness metric. However in certain dimensions the studies were highly representative, many covering 100% of the phase space in specific dimensions.
A fact that jumped out at me, because of the field I work in, is that there are 245 Objective-C projects in the universe studied by this paper (the projects indexed on Ohloh) and that not one of these is covered by any of the studies they analysed. That could mean that my own back yard is ripe for the picking: that there are interesting results to be determined by analysing Objective-C projects and comparing those results with received wisdom.
In the discussion section, the authors point out that just because a study is not general, does not mean it is not useful. You may not be able to generalise a result gleaned from analysing (say) Java developer tools to all software development, but if what you’re interested in is Java developer tools then that is not a problem.
What this paper gives us, then, is not necessarily a tool that commercial developers can run out and use. It gives us some important quantitative context on evaluating the research that we do read. And, should we want to analyse our own work and investigate hypotheses about factors affecting our projects, it gives us a framework to understand just how representative those analyses would be.