Software, Science?

Is there any science in software making? Does it make sense to think of software making as scientific? Would it help if we could?

Hold on, just what is science anyway?

Good question. The medieval French philosopher-monk Buridan said that the source of all knowledge is experience, and Richard Feynman paraphrased this as “the test of all knowledge is experiment”.

If we accept that science involves some experimental test of knowledge, then some of our friends who think of themselves as scientists will find themselves excluded. Consider the astrophysicists and cosmologists. It’s all very well having a theory about how stars form, but if you’re not willing to swirl a big cloud of hydrogen, helium, lithium, carbon etc. about and watch what happens to it for a few billion years then you’re not doing the experiment. If you’re not doing the experiment then you’re not a scientist.

Our friends in what are called the biological and medical sciences also have some difficulty now. A lot of what they do is tested by experiment, but some of the experiments are not permitted on ethical grounds. If you’re not allowed to do the experiment, maybe you’re not a real scientist.

Another formulation (OK, I got this from the wikipedia entry on Science) sees science as a sort of systematic storytelling: the making of “testable explanations and predictions about the universe”.

Under this definition, there’s no problem with calling astronomy a science: you think this is how things work, then you sit, and watch, and see whether that happens.

Of course a lot of other work fits into the category now, too. There’s no problem with seeing the “social sciences” as branches of science: if you can explain how people work, and make predictions that can (in principle, even if not in practice) be tested, then you’re doing science. Psychology, sociology, economics: these are all sciences now.

Speaking of the social sciences, we must remember that science itself is a social activity, and that the way it’s performed is really defined as the explicit and implicit rules and boundaries created by all the people who are doing it. As an example, consider falsificationism. This is the idea that a good scientific hypothesis is one that can be rejected, rather than confirmed, by an appropriately-designed experiment.

Sounds pretty good, right? Here’s the interesting thing: it’s also pretty new. It was mostly popularised by Karl Popper in the 20th Century. So if falsificationism is the hallmark of good science, then Einstein didn’t do good science, nor did Marie Curie, nor Galileo, or a whole load of other people who didn’t share the philosophy. Just like Dante’s Virgil was not permitted into heaven because he’d been born before Christ and therefore could not be a Christian, so all of the good souls of classical science are not permitted to be scientists because they did not benefit from Popper’s good message.

So what is science today is not necessarily science tomorrow, and there’s a sort of self-perpetuation of a culture of science that defines what it is. And of course that culture can be critiqued. Why is peer review appropriate? Why do the benefits outweigh the politics, the gazumping, the gender bias? Why should it be that if falsification is important, negative results are less likely to be published?

Let’s talk about Physics

Around a decade ago I was studying particle physics pretty hard. Now there are plenty of interesting aspects to particle physics. Firstly that it’s a statistics-heavy discipline, and that results in statistics are defined by how happy you are with them, not by some binary right/wrong criterion.

It turns out that particle physicists are a pretty conservative bunch. They’ll only accept a particle as “discovered” if the signal indicating its existence is measured as a five-sigma confidence: i.e. if there’s under a one-on-a-million chance that the signal arose randomly in the absence of the particle’s existence. Why five sigma? Why not three (a 99.7% confidence) or six (to keep Motorola happy)? Why not repeat it three times and call it good, like we did in middle school science classes?

Also, it’s quite a specialised discipline, with a clear split between theory and practice and other divisions inside those fields. It’s been a long time since you could be a general particle physicist, and even longer since you could be simply a “physicist”. The split leads to some interesting questions about the definition of science again: if you make a prediction which you know can’t be verified during your lifetime due to the lag between theory and experimental capability, are you still doing science? Does it matter whether you are or not? Is the science in the theory (the verifiable, or falsifiable, prediction) or in the experiment? Or in both?

And how about Psychology, too

Physicists are among the most rational people I’ve worked with, but psychologists up the game by bringing their own feature to the mix: hypercriticality. And I mean that in the technical sense of criticism, not in the programmer “you’re grammar sucks” sense.

You see, psychology is hard, because people are messy. Physics is easy: the apple either fell to earth or it didn’t. Granted, quantum gets a bit weird, but it generally (probably) does its repeatable thing. We saw above that particle physics is based on statistics (as is semiconductor physics, as it happens); but you can still say that you expect some particular outcome or distribution of outcomes with some level of confidence. People aren’t so friendly. I mean, they’re friendly, but not in a scientific sense. You can do a nice repeatable psychology experiment in the lab, but only by removing so many confounding variables that it’s doubtful the results would carry over into the real world. And the experiment only really told you anything about local first year psychology undergraduates, because they’re the only people who:

walked past the sign in the psychology department advertising the need for participants;
need the ten dollars on offer for participation desperately enough to turn up.

In fact, you only really know about how local first year psychology undergraduates who know they’re participating in a psychology experiment behave. The ethics rules require informed consent which is a good thing because otherwise it’s hard to tell the difference between a psychology lab and a Channel 4 game show. But it means you have to say things like “hey this is totally an experiment and there’ll be counselling afterward if you’re disturbed by not really electrocuting the fake person behind the wall” which might affect how people react, except we don’t really know because we’re not allowed to do that experiment.

On the other hand, you can make observations about the real world, and draw conclusions from them, but it’s hard to know what caused what you saw because there are so many things going on. It’s pretty hard to re-run the entire of a society with just one thing changed, like “maybe if we just made Hitler an inch taller then Americans would like him, or perhaps try the exact same thing as prohibition again but in Indonesia” and other ideas that belong in Philip K. Dick novels.

So there’s this tension: repeatable results that might not apply to the real world (a lack of “ecological validity”), and real-world phenomena that might not be possible to explain (a lack of “internal validity”). And then there are all sorts of other problems too, so that psychologists know that for a study to hold water they need to surround what they say with caveats and limitations. Thus is born the “threats to validity” section on any paper, where the authors themselves describe the generality (or otherwise) of their results, knowing that such threats will be a hot topic of discussion.

But all of this—the physics, the psychology, and the other sciences—is basically a systematised story-telling exercise, in which the story is “this is why the universe is as it is” and the system is the collection of (time-and-space-dependent) rules that govern what stories may be told. It’s like religion, but with more maths (unless your religion is one of those ones that assigns numbers to each letter in a really long book then notices that your phone number appears about as many times as a Poisson distribution would suggest).

Wait, I think you were talking about software

Oh yeah, thanks. So, what science, if any, is there in making software? Does there exist a systematic approach to storytelling? First, let’s look at the kinds of stories we need to tell.

The first are the stories about the social system in which the software finds itself: the story of the users, their need (or otherwise) for a software system, their reactions (or otherwise) to the system introduced, how their interactions with each other change as a result of introducing the system, and so on. Call this requirements engineering, or human-computer interaction, or user experience; it’s one collection of stories.

You can see these kinds of stories emerging from the work of Manny Lehman. He identifies three types of software:

an S-system is exactly specified.
a P-system executes some known procedure.
an E-system must evolve to meet the needs of its environment.

It may seem that E-type software is the type in which our stories about society are relevant, but think again: why do we need software to match a specification, or to follow a procedure? Because automating that specification or procedure is of value to someone. Why, or to what end? Why that procedure? What is the impact of automating it? We’re back to telling stories about society. All of these software systems, according to Lehman, arise from discovery of a problem in the universe of discourse, and provide a solution that is of possible interest in the universe of discourse.

The second kind are the stories about how we worked together to build the software we thought was needed in the first stories. The practices we use to design, build and test our software are all tools for facilitating the interaction between the people who work together to make the things that come out. The things we learned about our own society, and that we hope we can repeat (or avoid) in the future, become our design, architecture, development, testing, deployment, maintenance and support practices. We even create our own tools—software for software’s sake—to automate, ease or disrupt our own interactions with each other.

You’ll mostly hear the second kind of story at most developer conferences. I believe that’s because the people who have most time and inclination to speak at most developer conferences are consultants, and they understand the second stories to a greater extent than the first because they don’t spend too long solving any particular problem. It’s also because most developer conferences are generally about making software, not about whatever problem it is that each of the attendees is trying to solve out in the world.

I’m going to borrow a convention that Rob Rix told me in an email, of labelling the first type of story as being about “external quality” and the second type about “internal quality”. I went through a few stages of acceptance of this taxonomy:

Sounds like a great idea! There really are two different things we have to worry about.
Hold on, this doesn’t sounds like such a good thing. Are we really dividing our work into things we do for “us” and things we do for “them”? Labelling the non-technical identity? That sounds like a recipe for outgroup homogeneity effect.
No, wait, I’m thinking about it wrong. The people who make software are not the in-group. They are the mediators: it’s just the computers and the tools on one side of the boundary, and all of society on the other side. We have, in effect, the Janus Thinker: looking on the one hand toward the external stories, on the other toward the internal stories, and providing a portal allowing flow between the two.

So, um, science?

What we’re actually looking at is a potential social science: there are internal stories about our interactions with each other and external stories about our interactions with society and of society’s interactions with the things we create, and those stories could potentially be systematised and we’d have a social science of sorts.

Particularly, I want to make the point that we don’t have a clinical science, an analogy drawn by those who investigate evidence-based software engineering (which has included me, in my armchair way, in the past). You can usefully give half of your patients a medicine and half a placebo, then measure survival or recovery rates after that intervention. You cannot meaningfully treat a software practice, like TDD as an example, as a clinical intervention. How do you give half of your participants a placebo TDD? How much training will you give your ‘treatment’ group, and how will you organise placebo training for the ‘control’ group? [Actually I think I’ve been on some placebo training courses.]

In constructing our own scientific stories about the world of making software, we would run into the same problems that social scientists do in finding useful compromises between internal and ecological validity. For example, the oft-cited Exploratory experimental studies comparing online and offline programming performance (by Sackman et al., 1968) is frequently used to support the notion that there are “10x programmers”, that some people who write software just do it ten times faster than others.

However, this study does not have much ecological validity. It measures debugging performance, using either an offline process (submitting jobs to a batch system) or an online debugger called TSS, which probably isn’t a lot like the tools used in debugging today. The problems were well-specified, thus removing many of the real problems programmers face in designing software. Participants were expected to code a complete solution with no compiler errors, then debug it: not all programmers work like that. And where did they get their participants from? Did they have a diverse range of backgrounds, cultures, education, experience? It does not seem that any results from that study could necessarily apply to modern software development situated in a modern environment, nor could the claim of “10x programmers” necessarily generalise as we don’t know who is 10x better than whom, even at this one restricted task.

In fact, I’m also not convinced of its internal validity. There were four conditions (two programming problems and two debugging setups), each of which was assigned to six participants. Variance is so large that most of the variables are independent of each other (the independent variables are the programming problem and the debugging mode, and the dependent variables are the amount of person-time and CPU-time), unless the authors correlate them with “programming skill”. How is this skill defined? How is it measured? Why, when the individual scores are compared, is “programming skill” not again taken into consideration? What confounding variables might also affect the wide variation in scored reported? Is it possible that the fastest programmers had simply seen the problem and solved it before? We don’t know. What we do know is that the reported 28:1 ratio between best and worst performers is across both online and offline conditions (as pointed out in, e.g., The Leprechauns of Software Engineering, so that’s definitely a confounding factor. If we just looked at two programmers using the same environment, what difference would be found?

We had the problem that “programming skill” is not well-defined when examining the Sackman et al. study, and we’ll find that this problem is one we need to overcome more generally before we can make the “testable explanations and predictions” that we seek. Let’s revisit the TDD example from earlier: my hypothesis is that a team that adopts the test-driven development practice will be more productive some time later (we’ll defer a discussion of how long) than the null condition.

OK, so what do we mean by “productive”? Lines of code delivered? Probably not, their amount varies with representation. OK, number of machine instructions delivered? Not all of those would be considered useful. Amount of ‘customer value’? What does the customer consider valuable, and how do we ensure a fair measurement of that across the two conditions? Do bugs count as a reduction in value, or a need to do additional work? Or both? How long do we wait for a bug to not be found before we declare that it doesn’t exist? How is that discovery done? Does the expense related to finding bugs stay the same in both cases, or is that a confounding variable? Is the cost associated with finding bugs counted against the value delivered? And so on.

Software dogma

Because our stories are not currently very testable, many of them arise from a dogmatic belief that some tool, or process, or mode of thought, is superior to the alternatives, and that there can be no critical debate. For example, from the Clean Coder:

The bottom line is that TDD works, and everybody needs to get over it.

No room for alternatives or improvement, just get over it. If you’re having trouble defending it, apply a liberal sprinkle of argumentum ab auctoritate and tell everyone: Robert C. Martin says you should get over it!

You’ll also find any number of applications of the thought-terminating cliché, a rhetorical technique used to stop cognitive dissonance by allowing one side of the issue to go unchallenged. Some examples:

“I just use the right tool for the job”—OK, I’m done defending this tool in the face of evidence. It’s just clearly the correct one. You may go about your business. Move along.
“My approach is pragmatic”—It may look like I’m doing the opposite of what I told you earlier, but that’s because I always do the best thing to do, so I don’t need to explain the gap.
“I’m passionate about [X]”—yeah, I mean your argument might look valid, I suppose, if you’re the kind of person who doesn’t love this stuff as much I do. Real craftsmen would get what I’m saying.
and more.

The good news is that out of such religious foundations spring the shoots of scientific thought, as people seek to find a solid justification for their dogma. So just as physics has strong spiritual connections, with Steven Hawking concluding in A Brief History of Time:

However, if we discover a complete theory, it should in time be understandable by everyone, not just by a few scientists. Then we shall all, philosophers, scientists and just ordinary people, be able to take part in the discussion of the question of why it is that we and the universe exist. If we find the answer to that, it would be the ultimate triumph of human reason — for then we should know the mind of God.

and Einstein debating whether quantum physics represented a kind of deific Dungeons and Dragons:

[…] an inner voice tells me that it is not yet the real thing. The theory says a lot, but does not really bring us any closer to the secret of the “old one.” I, at any rate, am convinced that He does not throw dice.

so a (social) science of software could arise as an analogous form of experimental theology. I think the analogy could be drawn too far: the context is not similar enough to the ages of Islamic Science or of the Enlightenment to claim that similar shifts to those would occur. You already need a fairly stable base of rational science (and its application via engineering) to even have a computer at all upon which to run software, so there’s a larger base of scientific practice and philosophy to draw on.

It’s useful, though, when talking to a programmer who presents themselves as hyper-rational, to remember to dig in and to see just where the emotions, fallacious arguments and dogmatic reasoning are presenting themselves, and to wonder about what would have to change to turn any such discussion into a verifiable prediction. And, of course, to think about whether that would necessarily be a beneficial change. Just as we can critique scientific culture, so should we critique software culture.

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

FSF