Structure and Interpretation of Computer Programmers

I make it easier and faster for you to write high-quality software.

Thursday, January 14, 2021

Data curation during a pandemic

Here’s what I’ve been working on (with others, of course) since February.

posted by Graham at 18:34  

Tuesday, July 9, 2019

The challenges of teaching software engineering

I’ve just finished teaching a four-day course introducing software engineering for the first time. My plan is to refine the course (I’m teaching it again in October), and it will eventually become the basis for doctoral training programmes in research software engineering at Oxford, and part of a taught Masters. My department already has an M.Sc. in Software Engineering for commercial engineers (in fact I have that degree), and we want to do the same for software engineers in research context.

Of course, I can also teach your team about software engineering!

Some challenges that came up:

  • I’m too comfortable with the command-line to get people past the initial unfamiliar discomfort. From that perspective, command-line tools are all unusably hard. I’ve learnt from various sources to try foo --help, man foo, and other incantations. Others haven’t.

  • git, in particular, is decidedly unfriendly. What I want to do is commit my changes. What I have to do is stage my changes, then commit my staged changes. As a result, teaching git use takes a significant chunk of the available time, and still leaves confusion.

  • you need to either tell people how to set their core.editor, or how to quit vim.

  • similarly, there’s a world of difference between python foo.py and python3 foo.py, and students aren’t going to interpret the sorts of errors you et if you choose the wrong one.

  • Introduce a tangent, and I run the risk of losing people to that tangent. I briefly mentioned UML while discussing diagrams of objects, as a particular syntax for those diagrams. In the subsequent lab, some people put significant time into making sure their diagrams were valid UML.

  • Finding the trade-off between presentation, tutorial, and self-directed exercise is difficult. I’m used to presentations and will happily talk on many topics, but even I get bored of listening to me after the ~50% of the time I’ve spent speaking on this course. It must be worse for the students. And there’s no substitute for practical experience, but that must be supported by guidance.

  • There are so many topics that I didn’t get to cover!

    • only having an hour for OOP is a sin
    • which means I didn’t even mention patterns or principles
    • similarly, other design techniques like functional programming got left off
    • principles like Agile Software Development, Software Craftsmanship, or Devops don’t get a mention
    • continuous integration and continuous delivery got left off. Even if they didn’t, the amount of work involved in going from “I have a Python script” to “I run my tests whenever I change my script, and update my PYpi package whenever they pass” is too damn high.
    • forget databases, web servers, browsers, mobile apps, desktop apps, IoT, or anything that isn’t a command line script or a jupyter notebook
    • and machine learning tools
    • and concurrency, processes and process improvement, risk management, security, team dynamics, user experience, accessibility…

It’s only supposed to be a taster but I have to trade off introducing everything with showing the value present in anything. What this shows, as I found when I wrote APPropriate Behaviour, is that there’s a load that goes into being a programmer that is not programming.

posted by Graham at 16:13  

Thursday, May 16, 2019

Deprecating yarn

In which I help Oxford University CS department with their threading issues.

posted by Graham at 14:16  

Monday, May 13, 2019

Oxford University course on collaborative coding

Niche-audience topic time: if you’re in Oxford Uni, I’m giving a one-day course on collaborative software engineering with git and GitHub (the ideas apply to GitLab, Bitbucket etc. too) on 4th June, 10-3 at the Maths Institute. Look out for information from the OxfordRSE group with a sign-up link!

posted by Graham at 15:23  

Friday, March 29, 2019

Runtime verification in Erlang by using contracts

About this paper

Runtime verification in Erlang by using contracts, L.-A. Fredlund et al, presented at WFLP 2018.

Notes

Spoiler alert, but the conclusion to my book OOP the Easy Way is that we should have independently-running objects, like we do in Erlang. We should also document the contract between an object and its collaborators, like we do in Eiffel. The concurrency is there because processors are getting wider, not faster, and applications are becoming more distributed across IOT, “edge compute”, and servers. The contracts are there because we want to tame that complexity, tell people how each of the bits of our system can be composed, and what to expect if they try that. And we want to verify that documentation automatically, keeping ourselves honest.

As the authors here note, Erlang already has good facilities for recovering from errors, but that’s only really suitable for errors that are “things that might go wrong”. Programmer mistakes – call them bugs, logic errors, whatever – are things that shouldn’t go wrong, and Erlang doesn’t help there. Eiffel has really good tools for dealing with things that shouldn’t go wrong.

This paper introduces Erlang Design by Contract, which adds the relevant Eiffel bits to Erlang. At its core is a contract specification that’s richer than that found in Eiffel, allowing developers to assert that particular functions are pure, that they complete within given time, or that a recursive algorithm is trending towards completion.

The bit that achieves our aim, of combining the parallelism of Erlang with the safety of Eiffel, is the cpre/3 function, for implementing concurrent preconditions. The server inspects a message, the sender (from), and its internal state, and decides either to handle the message or queue it for later, changing the server state as appropriate.

In this way, preconditions act not as assertions that trigger failure, but as wait conditions that say when an object is ready to handle a message. The Concurrency Made Easy project found this too, when they built SCOOP, which works the other way and adds concurrency to Eiffel.

EDBC and SCOOP are two very different approaches that start from different places and aim for the middle. It’s really interesting to see that there is common ground: that “precondition as wait” arises whether you add concurrency to contracts, or contracts to concurrency.

posted by Graham at 16:00  

Wednesday, March 6, 2019

Mach and Matchmaker: kernel and language support for object-oriented distributed systems

About this paper

Mach and Matchmaker: kernel and language support for object-oriented distributed systems
, Michael B. Jones and Richard F. Rashid, from the proceedings of OOPSLA ’86.

Notes

Yes, 1986 was a long time ago, but the topics of Mach and Matchmaker are still relevant, and I find it interesting to read about its genesis and development. I also find that it helps me put today’s uses – or abandonments – in context.

Mach

Two main families of operating systems under development today are still based on the CMU Mach project. Let’s get discussing the HURD out of the way, first. The HURD is based on GNU Mach, which is itself based on the University of Utah’s Mach 4.0 project. GNU Mach is a microkernel, so almost all of the operating system facilities are provided by user-space processes. An interesting implication is that a regular user can create a sub-HURD, an environment with a whole UNIX-like system running within their user account on the host HURD.

Not many people do that, though. HURD is very interesting to read and use, but didn’t fulfil its goal of becoming a free host for the GNU system that made it easy to support hardware. Linux came along, as a free host for the GNU system that made it worthwhile to support hardware. I enjoy using the HURD, but we’ll leave it here.

…because we need to talk about the other operating system family that uses Mach, macOS/iOS/watchOS/tvOS/whatever the thing that runs the touchbar on a Macbook Pro is called OS. These are based on CMU Mach 2.5, for the most part, which is a monolithic kernel. Broadly speaking, Mach was developed by adding bits to the 4.2 (then 4.3) BSD kernel, until it became possible to remove all of the BSD bits and still have a working BSD-like system. Mach 2.5 represents the end of the “add Mach bits to a BSD kernel” part of the process.

Based on an earlier networked environment called Accent, Mach has an object-oriented facility in which object references are called “ports”, and you send them messages by…um, sending them messages. But sending them messages is really hard, because you have to get all the bits of all the parameters in the right place. What you need is…

Matchmaker

Originally built for Accent, Matchmaker is an Interface Definition Language in which you describe the messages you want a client and server to use, and it generates procedures for sending the messages and receiving the responses in the client, and receiving the messages and sending the responses in the server.

Being built atop Mach, Matchmaker turns those messages into Mach messages sent between Mach ports. What Mach does to get the messages around is transparent, so it might take a message on one computer and deliver it to a server on a different computer, maybe even running a different architecture.

That transparency was a goal of a lot of object-oriented remote procedure call systems in the 1990s, and by and large fell flat. The reason is Peter Deutsch’s Eight Fallacies of Distributed Computing. Basically you usually want to know when your message is going out over a network, because that changes everything from how likely it is to be received, how likely you are to get a response, to how expensive it will be to send the message.

Matchmaker supported C, Common LISP, Ada, and PERQ Pascal; Accent and Mach messages, and a bunch of different computer architectures. Unfortunately it supported them all through specific knowledge of each, and the paper described here acknowledges how difficult that makes it to work on and proposes future work to clean it up. It’s not clear that future work was ever done; modern Machs all use MIG, an “interim subset” of Matchmaker that only supports C.

Object-oriented design

In my book OOP the Easy Way, I explore the idea that objects are supposed to be small, independent computer programs that communicate over the loosely-coupled channel that is message-sending. Mach and Matchmaker together implement this design. Your objects can be in different languages, on different computers, even in different host operating systems (there were Mach IPC implementations for Mach, obviously, but also VAX Unix and non-Mach BSD). As long as they understand the same format for messages, they can speak to each other.

Consider a Cocoa application. It may be written in Swift or Objective-C or Objective-C++ or Python or whatever. It has a reference to a window, where it draws its views. The app sees that window as an Objective-C object, but the app doesn’t have a connection to the framebuffer to draw windows.

The Window Server has that connection. So when you create a window in your Cocoa application, you actually send a message to the window server and get back a port that represents your window. Messages sent to the window are forwarded to the window server, and events that happen in the UI (like the window being closed or resized) are sent as messages to the application.

Because of the way that Mach can transparently forward messages, it’s theoretically possible for an application on one computer to display its UI on another computer’s window server. In fact, that’s more than a theoretical possibility. NeXTSTEP supported exactly that capability, and an application with the NXHost default set could draw to a window server on a different computer, even one with a different CPU architecture.

This idea of loosely-coupled objects keeps coming up, but particular implementations rarely stay around for long. Mach messages still exist on HURD and Apple’s stuff (both using MIG, rather than Matchmaker), but HURD is tiny and Apple recommend against using Mach or MIG directly, favouring other interfaces like XPC or the traditional UNIX IPC systems that are implemented atop Mach. Similarly, PDO has come and gone, as have CORBA and its descendents DSOM and DOE.

Even within the world of “let’s use HTTP for everything”, SOAP gave way to REST, which gave way to the limited thing you get if you do the CRUD bits of REST without doing the DAP bits. What you learn by understanding Mach and its interfaces is that this scheme can be applied everywhere from an internet service down to an operating system component.

posted by Graham at 13:00  

Monday, February 11, 2019

Input-Output Maps are Strongly Biased Towards Simple Outputs

About this paper

Input-Output Maps are Strongly Biased Towards Simple Outputs, Kamaludin Dingle, Chico Q. Camargo and Ard A. Louis, Nature Communications 9, 761 (2018).

Notes

On Saturday I went to my alma mater’s Morning of Theoretical Physics, which was actually on “the Physics of Life” (or Active Matter as theoretical physicists seem to call it). Professor Louis presented this work in relation to RNA folding, but it’s the relevance to neural networks that interested me.

The assertion made in this paper is that if you have a map of a lot of inputs to a (significantly smaller, but still large) collection of outputs, the outputs are not equally likely to occur. Instead, the simpler outputs are preferentially selected.

A quick demonstration of the intuition behind this argument: imagine randomly assembling a fixed number of lego bricks into a shape. Some particular unique shape with weird branches can only be formed by an individual configuration of the bricks. On the other hand, a simpler shape with large degree of symmetry can be formed from different configurations. Therefore the process of randomly selecting a shape will preferentially pick the symmetric shape.

The complexity metric that’s useful here is called Kolmogorov complexity, and roughly speaking it’s a measure of the length of a Universal Turing Machine program needed to describe the shape (or other object). Consider strings. A random string of 40 characters, say a56579418dc7908ce5f0b24b05c78e085cb863dc, may not be representable in any more efficient way than its own characters. But the string aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, which is 40 characters, can be written with the Python program:

'a'*40

which is seven characters long including the newline. Assuming eight bits per character, the random string needs 40*8=320 bits to be represented. The forty as can be found by actually finding the Python program, which is 56 bits. The assertion is that a “find a program that generates character sequences of length 40” algorithm (with some particular assumptions in place) will find the a56579… string with probablity 2^-320, but will find the aaa… string with probability 2^-56, which is much, much more likely.

In fact, this paper shows that the upper and lower bounds on the probability of a map yielding a particular output for random input are both dependent on the Kolmogorov complexity of the output. It happens that due to the halting problem, you can’t calculate Kolmogorov complexity for arbitrary outputs. But you can approximate it, for example using Lempel-Ziv complexity (i.e. the length of the input to a lossless compression algorithm needed to recover the same output).

Where does this meet neural networks? In a preprint of a paper selected for the ICLR 2019, with two of the same authors as this paper. Here, we find that a neural network can be thought of as a map between the inputs and weights to a function that does the inference.

Typically neural network architectures have lots more parameters than there are points in the training set, so how is it that they manage to generalise so well? And why is it that different training techniques, including stochastic gradient descent and genetic algorithms, result in networks with comparable performance?

The authors argue that a generalising function is much less complex than an overfitting function, using the same idea of complexity shown above. And that as the training process for the network is sampling the map of inputs->functions, it is more likely to hit on the simple functions than the complex ones. Therefore the fact that neural networks generalise well is intrinsic to the way they select functions from a wealth of possibilities.

My hope is that this is a step toward a predictive theory of neural network architectures. That by knowing the something of the function we want to approximate, we can set a lower bound on the complexity of a network needed to discover sufficiently generalisable functions. This would be huge for both reducing the training effort needed for networks, and for reducing the evaluation runtime. That, in turn, would make it easier to use pretrained networks on mobile and IoT devices.

posted by Graham at 13:30  

Thursday, January 31, 2019

How UX Practitioners Produce Findings in Usability Testing

The Paper

How UX Practitioners Produce Findings in Usability Testing by Stuart Reeves, in ACM Transactions on Computer-Human Interaction, January 2019.

Notes

Various features of this paper make it a shoe-in for Research Watch.

  • It is about the intersection between academia and commercial practice. That is where the word “Labrary” comes from.
  • It extends the usual “human-computer interaction” focus of UX to include the team performing the UX, which an aspect of PETRI.
  • I get to use the word “praxeology”.

Reeves compares the state of UX in the academic literature with the state of UX in commercial fields. He finds a philosophical gap that is similar to something I observed when studying “Requirements Engineering” on a Software Engineering M.Sc. course. Generally, the academic treatment of UX describes usability problems as things that exist, and that the task of UX activities is to find them.

The same can be seen in much early literature on requirements engineering. We assume that there is a Platonic model of how a software product should work, and that the job of the requirements engineer is to “gather” requirements from the stakeholders. Picture a worker with a butterfly net, trying to collect in these elusive and flighty requirements so they can pin them down in a display case made by the Jira Cabinet Company.

There’s an idea here that, even before it’s formed, the software is real and has an identity independent of the makers, users, and funders. Your role in the software production process is one of learning and discovery, trying to attain or at least approximate this ideal view of the system that’s out there to be had.

Contrasted with this is the “postmodern” view, which is a more emergent view. Systems and processes result from the way that we come together and interact. A software system both mediates particular interactions and blocks or deters others. The software system itself is the interaction between people, and developments in it arise as a result of their exchanges.

In this worldview, there are not “UX problems” to be found by adequate application of UX problem-discovery tools. There are people using software, people observing people using software, and people changing software, and sometimes their activities come together to result in a change to the software.

This philosophy is the lens through which Reeves engages in the praxeology (study of methods) of UX practitioners. His method is informed by ethnomethodological conversation analysis, which is an academic way of saying “I watched people in their context, paying particular attention to what they said to each other”.

The UX activity he describes is performed by actors in two different rooms. In the test room, the participant uses a computer to achieve a goal, with some context and encouragement provided by a moderator. The rest of the team are in the observation room, where they can see and hear the test room and the participant’s screen but talk amongst themselves.

Four representative fragments expose different features of the interactions, and to my mind show that UX is performative, arising from those interactions rather than being an intrinsic property of the software.

  • In fragment A, the participant reports a problem, the observers react and decide to report it.
  • In fragment B, the participant reports a problem, the observers react and suppress reporting it.
  • In fragment C, the participant does not seem to be having a problem, but the observers comment that they did not do something they would have expected, and discuss whether this is an issue.
  • In fragment D, the participant is working on the task but does not choose the expected approach, observers see that, and define a problem and a solution that encompasses that.

One observation here is that even where a participant is able to complete the task, a problem was raised. The case in fragment D is that the participant was asked how they would report a problematic advert. They described sending an email to the client. That would work. However, the product team see that as a problem, because they are working on the “submit a complaint” feature on the website. So, even though the task goal can be satisfied, it was not satisfied the way they want, which means there’s a UX problem.

There are all sorts of things to learn from this. One is that you can’t separate the world neatly into “ways humans do things” and “measurements of the ways humans do things”, because the measurements themselves are done by humans who have ways of doing things. Another is that what you get out of UX investigations depends as much on the observers as it does on the participants’ abilities. What they choose to collectively see as problems and to report as problems depends on their views and their interactions to an extent comparable to their observations of the participants working through the tasks.

Ultimately it’s more evidence for the three systems model. Your team, your software, and your customers are all interacting in subtle ways. Behaviour in any one of these parts can cause significant changes in the others.

posted by Graham at 13:15  

Thursday, January 24, 2019

Structured Pruning of Deep Convolutional Neural Networks

Structured Pruning of Deep Convolutional Neural Networks, Sajid Anwar et al. In the ACM Journal on Emerging Technologies in Computing special issue on hardware and algorithms for learning-on-a-chip, May 2017.

Notes

Quick, a software engineer mentions a “performance” problem to you. What do they mean?

This is, of course, an unfair question. There are too many different ideas that all get branded “performance” for us to know what we are trying to solve. This paper is simultaneously about two different flavours of performance.

On the one hand, the “performance” of a neural network is related to its ability to perform its task: the correctness of its inferences. There isn’t really a good way to know what neural network configuration will perform efficiently for the (unknown) function you want it to approximate. Therefore, the rule of thumb is find a network that’s too complex, and stop training it when it begins to overfit (when its performance starts to degrade, because it’s being too specific about whether an input looks like an example from the training set rather than whether it shares important features).

Now we meet the other kind of performance: the amount of resources consumed to do the work. A large neural network needs to do a lot of computations with a lot of numbers to classify an input, and that means using a lot of processor cycles and a lot of memory. Because our approach to designing the network was to overspecify it, we are using more computer than we need. But if that computer is relatively low-specification and battery operated—a mobile phone for example—this may render our solution unusable.

So, how can we turn a complex neural network into a simpler neural network? While this isn’t totally satisfying, the answer is: “guess”. Turn off bits of the network (i.e. set weights to zero), and see whether it still classifies (performs) well. This act of turning bits of the network off is called pruning.

Ironically some previous work in this space has actually not been great for performance (the resource kind). You can “unroll” convolutional layers (commonly found in image-classifying networks) into matrix multiplications, and you can turn that into a sparse matrix by approximating all small weights with zero and only storing the non-zero values and their locations. But now, even though you have fewer calculations, you may have more memory accesses in trying to solve where the weights should be used. And that could be slower than not reducing the network.

The work in this paper takes a structured approach to pruning the network. Whole feature maps (scores indicating whether particular characteristics of an image were found, and where, in the input image) can be removed, the network retrained, and the performance (ability to classify) measured afterwards. At smaller scales, the kernels can be pruned in particular deterministic ways, replacing a full weights matrix with a start index, a “stride” (gap between each non-zero value) and the list of non-zero weights. The different possibilities are explored using a combination of random generation and evolutionary iteration; networks that have a misclassification rate within given tolerance the original are kept into subsequent generations.

The results seem promising. With pruning at both levels of abstraction, the resulting network is just as deep (it contains as many layers) but it has fewer nodes at each layer and fewer connections between nodes. The systematic pruning approach means that the resulting networks are smaller in memory and faster in use: CPU time measurements are down approximately two thirds when compared with the initial, unpruned network.

However, be careful when interpreting the graph: the authors are showing the reduced execution time of the unrolled matrix multiplication for one layer of one network configuration. It is not clear what this means for overall behaviour of the network, what the misclassification rate of this network was (they show a tolerance cutoff at 4%, which may be too high for a given use case), or in general how the CPU time savings vary with network topology. In other words, we have a single graph, and don’t know how to generalise it.

I hope that at some point a sound theoretical basis for choosing the architecture for a neural network to solve a given problem will be developed. In fact, I sort of hope that it exists now, and that I haven’t found it. I don’t think so: for the moment, whack-a-mole is the state of the art, and this paper shows you can whack quite a lot of moles and still get reasonable results.

posted by Graham at 17:20  

Tuesday, January 8, 2019

The ABC of Software Engineering Research

About this paper

The ABC of Software Engineering Research by Klaas-Jan Stol and Brian Fitzgerald, published October 2018. See link for full citation.

Notes

There are too many ways in which terms describing research methods in software engineering get used, and these authors have a solution. The reason, at least according to the introductory discussion in this paper, is in part a case of discipline envy. This is the idea that we don’t quite know what software engineering is, but we know what those people over there do, and we like that, so we’ll co-opt it.

You could argue that the entire idea of software engineering is discipline envy. A collection of computing experts from academia and industry didn’t quite know how to formalise the problems faced by software makers in the 1960s, but they did know what engineers do, and they liked that. In fact, it’s not clear that they (or at least we, their intellectual descendents) truly understood what engineers do, but nonetheless we gave it a jolly good go. In 1968, in the town of Garmisch in Germany, a discipline was born.

Now, it’s interesting that discipline envy has turned up so early in this discussion, because the contribution in this paper is a framework borrowed from social science researchers. To understand the applicability of cross-discipline seeding, we have to ask how strong the analogy “software engineering is just like X” seems to be (as well as identify how well the proposed idea worked in field X). So, is software engineering like a social science?

Here, the authors carve the field in two. They distinguish “solution-seeking” research, in which we identify what we ought to do about a problem, from “knowledge-seeking” research, in which we identify what people do do about the problem. The bad news about ditching the solution-seeking half of the discipline is that we just lost the engineering from software engineering, the bit where we use scientific results to propose novel solutions to problems.

The good news is that knowledge-seeking software engineering research does look quite a bit like a social science. People, in some context, do things, and we can try to understand that. Indeed that is the origin of the ABC initialism: Actors (the people), Behaviour (the things) and Context.

Well, we can understand bits of it at a time. Like good consultants, the authors introduce a quadrant diagram. On one axis, the “generalisability” of a research method, from highly universal contexts to deeply specific contexts. On the other, the “obtrusiveness” of the method, reflecting whether the researchers are passive observers or active interferers.

As the Labrary stands at the intersection, we approve of the idea that two different things lie on a continuum, rather than being an either/or choice. This framework makes the point that while a particular research technique or strategy is situated somewhere in the general/obtrusive map, others are available elsewhere. The reaction to a highly-controlled lab experiment should not be to declare that the result is not generalisable, but to understand what else could be done to explore generalisations of its results.

The discussions of where particular research strategies fit in the map are interesting, though some of the analogies drawn are fairly tenuous. The authors show where on the map the maximum applicability of a method for each of the key properties lies: universally contextual, unobtrusive research generalises over Actors best, while highly-obtrusive methods allow more precise measures of particular Behaviours and more focus gives a more realistic Context. It would be really beneficial to see a similar framework for “solution-seeking” literature, so we can evaluate the applicability of techniques developed in software engineering research to “practical” problems.

posted by Graham at 17:20  
Next Page »

Powered by WordPress