Did that restructuring work actually help?

Before getting into the meat of this post, I’d like to get into the meta of this post. This essay, and I imagine many in this blog [Ed: by which I meant the blog this has been imported from], will be treading a fine line. The intended aim is to question accepted industry practice, and find results consistent or inconsistent with the practice as a beneficial task to perform. I’m more likely to select papers that appear to refute the practice, as that’s more interesting and makes us introspect the way we work more than does affirmation. The danger is that this skates too close to iconoclasm, as expressed in the Goto Copenhagen talk title Is it just me or is everything shit?. My intention isn’t to say that whatever we’re doing is wrong, just to provide some healthy inspection and analysis of our industry.

Legacy Software Restructuring: Analyzing a Concrete Case

The thread in this paper is that metrics that have long been used to measure the quality of source code—metrics related to coupling and cohesion—may not actually be relevant to the problems developers have to solve. Firstly, the jargon:

coupling refers to the connections between the part of the software (module, class, function, whatever) under consideration and the rest of the software system. Received wisdom is that lower coupling (i.e. fewer connections that are less-tightly intertwined) is better.
cohesion refers to the relatedness of the tasks performed by the (module, class, function, whatever) under consideration. The more different responsibilities a component provides or uses, the lower its cohesion. Received wisdom is that higher cohesion (i.e. fewer responsibilities per module) is better.

We’re told that striving for low coupling and high cohesion will make the parts of our software reusable and replaceable, and will reduce the number of code sites we need to change when we want to fix bugs in the future. The focus of this paper is on whether the metrics we use as proxies for these properties actually represent enhancements to the code; in other words, whether we have a systematic way to decide whether a change is an improvement or not.

Approach

The way in which the authors test their metrics is necessarily problematic. There is no objective standard against which they can be prepared—if there were, we’d have an objective standard and we could all go home. They hypothesise that any restructuring effort by a development team must represent an improvement in the codebase: if you didn’t think a change was better, why would you make that change?

Necessity is one such reason. Consider the following thought process:

I need to add this feature to my product, this change was unforeseen at design time, so the architecture doesn’t really support it. I’m not very happy about the this, but shoehorning it in here is the simplest way to support what I need.

To understand other problems with this methodology, a one-paragraph introduction to the postmodern philosophy of software engineering is required. Software, it says, supports not some absolute set of requirements that were derived from studying the universe, but the ad-hoc set of interactions between the various people who engage with the software system. Indeed, the software system itself modifies these interactions, creating a feedback loop that in fact modifies the requirements of the software that was created. Some of the results of this philosophy[*] are expressed in “Manny” Lehman’s Laws of Software Engineering, which are also cited by the paper I’m talking about here. The authors offer one of Lehman’s Laws as:

[*] I don’t consider Lehman’s laws to be objectively true of software artefacts, but to be hypotheses that arise from a particular philosophy of software. I also think that philosophy has value.

Considering Lehman’s law of software evolution, such systems would already have suffered a decrease in their quality due to the maintenance. This would increase the probability that the restructuring has a better modular quality.

This statement is inconsistent. On the one hand, this change improves the quality of some software. On the other hand, the result of a collection of such changes is to decrease the quality. Now there’s nothing to say that a particular change won’t be an improvement; but there’s also nothing to say that the observed change has this property.[*] The postmodern philosophy adds an additional wrinkle: even if this change is better, it’s only better _as perceived by the people currently working with the system_. Others may have different ideas. We saw, in discussing the teaching of programming, that even experienced programmers can have difficulty reading somebody else’s code. I wouldn’t find it a big stretch to posit that different people have different ideas of what constitutes “good” modular decomposition, and that therefore a different set of programmers would think this change to be worse.

[*]Actually I think the sentence in the paper might just be broken; remember that I found this on the preprint server so it might not have been reviewed yet. One of Lehman’s laws says that, for “E-type” software (by which he means systems that evolve with their environment—in other words, systems where a postmodern appraisal is applicable), the software will gradually be perceived as _reducing_ in quality if no maintenance work goes into it. That’s because the system is evolving while the software isn’t; the requirements change without the software catching up.

Results and Discussion

The authors found that, for three particular revisions of Eclipse, the common metrics for coupling and cohesion did not monotonically “improve” with successive restructuring efforts. In some cases, both coupling and cohesion decreased in the same effort. In addition, they found that the number and extent of cyclic dependencies between Java packages increased with every successive version of the platform.

It’s not really possible to choose a conclusion to draw from these results:

maybe the dogma of increasing cohesion and decreasing coupling is misleading.
maybe the metrics used to measure those properties were poorly chosen (though they are commonly-chosen).
maybe the Eclipse developers use some other measurement of quality that the authors didn’t ask about.
maybe some of the Eclipse engineers do take these properties into account, and some others don’t, and we [can’t – added on import] even draw general conclusions about Eclipse.

So this paper doesn’t demonstrate that cohesion and coupling metrics are wrong. But it does raise the important question: might they be not right? If you’re relying on some code metrics derived from received wisdom or dogma, it’s time to question whether they really apply to what you do.

Legacy Software Restructuring: Analyzing a Concrete Case

Approach

Results and Discussion

About Graham

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

FSF