Mach and Matchmaker: kernel and language support for object-oriented distributed systems

About this paper

Mach and Matchmaker: kernel and language support for object-oriented distributed systems
, Michael B. Jones and Richard F. Rashid, from the proceedings of OOPSLA ’86.

Notes

Yes, 1986 was a long time ago, but the topics of Mach and Matchmaker are still relevant, and I find it interesting to read about its genesis and development. I also find that it helps me put today’s uses – or abandonments – in context.

Mach

Two main families of operating systems under development today are still based on the CMU Mach project. Let’s get discussing the HURD out of the way, first. The HURD is based on GNU Mach, which is itself based on the University of Utah’s Mach 4.0 project. GNU Mach is a microkernel, so almost all of the operating system facilities are provided by user-space processes. An interesting implication is that a regular user can create a sub-HURD, an environment with a whole UNIX-like system running within their user account on the host HURD.

Not many people do that, though. HURD is very interesting to read and use, but didn’t fulfil its goal of becoming a free host for the GNU system that made it easy to support hardware. Linux came along, as a free host for the GNU system that made it worthwhile to support hardware. I enjoy using the HURD, but we’ll leave it here.

…because we need to talk about the other operating system family that uses Mach, macOS/iOS/watchOS/tvOS/whatever the thing that runs the touchbar on a Macbook Pro is called OS. These are based on CMU Mach 2.5, for the most part, which is a monolithic kernel. Broadly speaking, Mach was developed by adding bits to the 4.2 (then 4.3) BSD kernel, until it became possible to remove all of the BSD bits and still have a working BSD-like system. Mach 2.5 represents the end of the “add Mach bits to a BSD kernel” part of the process.

Based on an earlier networked environment called Accent, Mach has an object-oriented facility in which object references are called “ports”, and you send them messages by…um, sending them messages. But sending them messages is really hard, because you have to get all the bits of all the parameters in the right place. What you need is…

Matchmaker

Originally built for Accent, Matchmaker is an Interface Definition Language in which you describe the messages you want a client and server to use, and it generates procedures for sending the messages and receiving the responses in the client, and receiving the messages and sending the responses in the server.

Being built atop Mach, Matchmaker turns those messages into Mach messages sent between Mach ports. What Mach does to get the messages around is transparent, so it might take a message on one computer and deliver it to a server on a different computer, maybe even running a different architecture.

That transparency was a goal of a lot of object-oriented remote procedure call systems in the 1990s, and by and large fell flat. The reason is Peter Deutsch’s Eight Fallacies of Distributed Computing. Basically you usually want to know when your message is going out over a network, because that changes everything from how likely it is to be received, how likely you are to get a response, to how expensive it will be to send the message.

Matchmaker supported C, Common LISP, Ada, and PERQ Pascal; Accent and Mach messages, and a bunch of different computer architectures. Unfortunately it supported them all through specific knowledge of each, and the paper described here acknowledges how difficult that makes it to work on and proposes future work to clean it up. It’s not clear that future work was ever done; modern Machs all use MIG, an “interim subset” of Matchmaker that only supports C.

Object-oriented design

In my book OOP the Easy Way, I explore the idea that objects are supposed to be small, independent computer programs that communicate over the loosely-coupled channel that is message-sending. Mach and Matchmaker together implement this design. Your objects can be in different languages, on different computers, even in different host operating systems (there were Mach IPC implementations for Mach, obviously, but also VAX Unix and non-Mach BSD). As long as they understand the same format for messages, they can speak to each other.

Consider a Cocoa application. It may be written in Swift or Objective-C or Objective-C++ or Python or whatever. It has a reference to a window, where it draws its views. The app sees that window as an Objective-C object, but the app doesn’t have a connection to the framebuffer to draw windows.

The Window Server has that connection. So when you create a window in your Cocoa application, you actually send a message to the window server and get back a port that represents your window. Messages sent to the window are forwarded to the window server, and events that happen in the UI (like the window being closed or resized) are sent as messages to the application.

Because of the way that Mach can transparently forward messages, it’s theoretically possible for an application on one computer to display its UI on another computer’s window server. In fact, that’s more than a theoretical possibility. NeXTSTEP supported exactly that capability, and an application with the NXHost default set could draw to a window server on a different computer, even one with a different CPU architecture.

This idea of loosely-coupled objects keeps coming up, but particular implementations rarely stay around for long. Mach messages still exist on HURD and Apple’s stuff (both using MIG, rather than Matchmaker), but HURD is tiny and Apple recommend against using Mach or MIG directly, favouring other interfaces like XPC or the traditional UNIX IPC systems that are implemented atop Mach. Similarly, PDO has come and gone, as have CORBA and its descendents DSOM and DOE.

Even within the world of “let’s use HTTP for everything”, SOAP gave way to REST, which gave way to the limited thing you get if you do the CRUD bits of REST without doing the DAP bits. What you learn by understanding Mach and its interfaces is that this scheme can be applied everywhere from an internet service down to an operating system component.

The balloon goes up

To this day, many Smalltalk projects have a hot air balloon in their logo. These reference the cover of the issue of Byte Magazine in which Smalltalk-80 was shared with the wider programming community.

A hot air balloon bearing the word "Smalltalk" sails over a castle on a small island.

Modern Smalltalks all have a lot in common with Smalltalk-80. Why? If you compare Smalltalk-72 with Smalltalk-80 there’s a huge amount of evolution. So why does Cincom Smalltalk or Amber Smalltalk or Squeak or even Pharo still look quite a lot like Smalltalk-80?

My answer is because they are used. Actually, Alan’s answer too:

Basically what happened is this vehicle became more and more a programmer’s vehicle and less and less a children’s vehicle—the version that got put out, Smalltalk ’80, I don’t think it was ever programmed by a child. I don’t think it could have been programmed by a child because it had lost some of its amenities, even as it gained pragmatic power.

So the death of Smalltalk in a way came as soon as it got recognized by real programmers as being something useful; they made it into more of their own image, and it started losing its nice end-user features.

I think there are two different things you want from a programming language (well, programming environment, but let’s not split tree trunks). Referencing the ivory tower on the Byte cover, let’s call them “academic” and “industrial”, these two schools.

The industrial ones are out there, being used to solve problems. They need to be stable (some of these problems haven’t changed much in decades), easy to understand (the people have changed), and they don’t need to be exciting, they just need to work. Cobol and Fortran are great in this regard, as is C and to some extent C++: you take code written a bajillion years ago, build it, and it works.

The academic ones are where the new ideas get tried out. They should enable experiment and excitement first, and maybe easy to understand (but if you need to be an expert in the idea you’re trying out, that’s not so bad).

So the industrial and academic languages have conflicting goals. There’s going to be bad feedback set up if we try to achieve both goals in one place:

  • the people who have used the language as a tool to solve problems won’t appreciate it if new ideas come along that mean they have to work to get their solution building or running correctly, again.
  • the people who have used the language as a tool to explore new ideas won’t appreciate it if backwards compatibility hamstrings the ability to extend in new directions.

Unfortunately at the moment a lot of languages are used for both, which leads to them being mediocre at either. The new “we’ve done C but betterer” languages like Go, Rust etc. feature people wanting to add new features, and people wanting not to have existing stuff broken. Javascript is a mess of transpilation, shims, polyfills, and other words that mean “try to use a language, bearing in mind that nobody agrees how it’s supposed to work”.

Here are some patterns for managing the distinction that have been tried in the past:

  • metaprogramming. Lisp in particular is great at having a little language that you can use to solve your problems, and that you can also use to make new languages or make the world work differently to see how that would work. Of course, if you can change the world then you can break the world, and Lisp isn’t great at making it clear that there’s a line between following the rules and writing new ones.
  • pragmas. Haskell in particular is great at having a core language that people understand and use to write software, and a zillion flags that enable different features that one person pursued in their PhD that one time. Not all of the flag combinations may be that great, and it might be hard to know which things work well and which worked well enough to get a dissertation out of. But these are basically the “enable academic mode” settings, anyway.
  • versions. Perl and Python both ran for years in which version x was the safe, stable, industrial language, and version y (it’s not x+1: Python’s parallel versions were 2 and 3000) in which people could explore extensions, removals, or other changes in potentially breaking ways. At some point, each project got to the point where they were happy with the choices, and declared the new version “ready” and available for industrial use. This involved some translation from version x, which wasn’t necessarily straightforward (though in the case of Python was commonly overblown, so people avoided going from 2 to 3 even when it was easy). People being what they are, they put a lot of store in version numbers. So some people didn’t like that folks were recommending to use x when there was this clearly newer y available.
  • FFIs. You can call industrial C89 code (which still works after three decades) from pretty much any academic language you care to invent. If you build a JVM language, it can do what it wants, and still call Java code.

Anyway, I wonder whether that distinction between academic and industrial might be a good one to strengthen. If you make a new programming language project and try to get “users” too soon, you could lose the ability to take the language where you want it to go. And based on the experience of Smalltalk, too soon might be within the first decade.

Image

I love my Testsphere deck, from Ministry of Testing. I’ve twice seen Riskstorming in action, and the first time that I took part I bought a deck of these cards as soon as I got back to my desk.

I’m not really a tester, though I have really been a tester in the past. I still fall into the trap of thinking that I set out to make this thing do a thing, I have made it do a thing, therefore I am done. I’m painfully aware when metacognating that I am definitely not done at that point, but back “in the zone” I get carried away by success.

One of the reasons I got interested in Design by Contract was the false sense of “done” I feel when TDDing. I thought of a test that this thing works. I made it pass the test. Therefore this thing works? Well, no: how can I keep the same workflow, and speed of progress but improve the confidence in the statement?

The Testsphere cards are like a collection of mnemonics for testers, and for people who otherwise find themselves wondering whether this software really works. Sometimes I cut the deck, look at the card I’ve found, and think about what it means for my software. It might make me think about new ways to test the code. It might make me think about criticising the design. It might make me question the whole approach I’m taking. This is all good: I need these cues.

I just cut the deck and found the “Image” card, which is in the Heuristics section of the deck. It says that it’s a consistency heuristic:

Is your product true to the image and reputation you or your app’s company wishes to project?

That’s really interesting. How would I test for that? OK, I need to know what success is, which means I need to know “the image and reputation [we wish] to project”. That sounds very much like a marketing thing. Back when I ran the mobile track at QCon London, Jaimee Newberry gave a great talk about finding the voice for your product. She suggested identifying a celebrity whose personality embodies the values you want to project, then thinking about your interactions with your customers as if that personality were speaking to them.

It also sounds like there’s a significant user or customer experience part to this definition. Maybe marketing can tell me what voice, tone, or image we want to suggest to our customers, but what does it mean to say that a touchscreen interface works like Lady Gaga? Is that swipe gesture the correct amount of quirky, unexpected, subversive, yet still accessible? Do the features we have built shout “Poker Face”?

We’ll be looking at user interface design, too. Graphic design. Sound design. Copyediting. The frequency of posts on the email list, and the level of engagement needed. Pricing, too: it’s no good the brochure projecting Fortnum & Mason if the menu says Five Guys.

This doesn’t seem like something I’m going to get from red to green in a few minutes in Emacs. And it’s one of a hundred cards.

Why 80?

80 characters per line is a standard worth sticking to, even today. OK, why?

Well, back up. Let’s examine the axioms. Is 80 characters per line a standard? Not really, it’s a convention. IBM cards (which weren’t just made by IBM or read by IBM machines) were certainly 80 characters wide, as were DEC video terminals, which Macs etc. emulate. Actually, that’s not even true. The DEC VT-05 could display 72 characters per line, their later VT-50 and successor models introduced 80 characters. The VT-100 could display 132 characters per line, the same quantity as a line printer (including the ones made by IBM). Other video terminals had 40 or 64 character lines. Teletypewriters typically had shorter lines, like 70 characters.

Typewriters were typically limited to \((\mathrm{width\ of\ page} – 2 \times \mathrm{margin\ width}) \times \mathrm{character\ density}\) characters per line. With wide margins and narrow US paper, you might get 50 characters: with narrow margins and wide A4 paper, maybe 100.

IBM were not the only people to make cards, punches, and readers. Other manufacturers did, with other numbers of characters per card. IBM themselves made 40, 45 and 96 column cards. Remington Rand made cards with 45 or 90 columns.

So, axiom one modified, “80 characters per line is a particular convention out of many worth sticking to, even today.” Is it worth sticking to?

Hints are that it isn’t. The effects of line length on reading online news explored screen-reading with different line lengths: 35, 55, 75 and 95 cpl. They found, from the abstract:

Results showed that passages formatted with 95 cpl resulted in faster reading speed. No effects of line length were found for comprehension or satisfaction, however, users indicated a strong preference for either the short or long line lengths.

However that isn’t a clear slam dunk. Quoting their reference to prior work:

Research investigating line length for online text has been inconclusive. Several studies found that longer line lengths (80 – 100 cpl) were read faster than short line lengths (Duchnicky and Kolers, 1983; Dyson and Kipping, 1998). Contrary to these findings, other research suggests the use of shorter line lengths. Dyson and Haselgrove (2001) found that 55 characters per line were read faster than either 100 cpl or 25 cpl conditions. Similarly, a line length of 45-60 characters was recommended by Grabinger and Osman-Jouchoux (1996) based on user preferences. Bernard, Fernandez, Hull, and Chaparro (2003) found that adults preferred medium line length (76 cpl) and children preferred shorter line lengths (45 cpl) when compared to 132 characters per line.

So, long lines are read faster than short lines, except when they aren’t. They also found that most people preferred the longest or shortest lines the most, but also that everybody preferred the shortest or longest lines the least.

But is 95cpl a magic number? What about 105cpl, or 115cpl? What about 273cpl, which is what I get if I leave my Terminal font settings alone and maximise the window in my larger monitor? Does it even make sense for programmers who don’t have to line up the comment markers in Fortran-77 code to be using monospaced fonts, or would we be better off with proportional fonts?

And that article was about online news articles, a particular and terse form of prose, being read by Americans. Does it generalise to code? How about the observation that children and adults prefer different lengths, what causes that change? Does this apply to people from other countries? Well, who knows?

Buse and Weimer found that “average line length” was “strongly negatively correlated” with perceived readability. So maybe we should be aiming for one-character lines! Or we can offset the occasional 1,000 character line by having lots and lots of one-character lines:

}
}
}
}
}
}

It sounds like there’s information missing from their analysis. What was the actual shape of the data? What were the maximum and minimum line lengths considered, what distribution of line lengths was there?

We’re in a good place to rewrite the title from the beginning of the post: 80 characters per line is a particular convention out of many that we know literally nothing about the benefit or cost of, even today. Maybe our developer environments need a bit of that UX thing we keep imposing on everybody else.

Ultimate Programmer Super Stack Reloaded

Remember remember the cough 6th of November, when APPropriate Behaviour joined a wealth of other learning material for software engineers in a super-discounted bundle called the Ultimate Programmer Super Stack?

It’s happening again! This is a five-day flash sale, with all same material on levelling up as a programmer, running a startup, and learning new technologies like Aurelia, Node, Python and more. The link at the top of this paragraph goes to the sales page, and you’ve got until Monday, when it’s gone for good.

The Fragile Manifesto

A lot of what I’ve been reading and thinking about of late is about the agile backlash. More speed, lower velocity reflects on IT teams pursuing “deliver more/newer IT” at the cost of “help the company achieve its mission”. Grooming the Backfog is about one dysfunction that arises as a result: (mis)managing a never-ending road of small changes rather than looking at the big picture and finding a path toward the destination. Our products are not our products attempts to address this problem by recasting teams not as makers of product, but as solvers of problems.

Here’s the latest: UK wasting £37 billion a year on failed agile IT projects. Some people will say that this is a result of not Agiling enough: if you were all Lean and MVP and whatever you’d not get to waste all of that money. I don’t necessarily agree with that: I think there’s actually things to learn by, y’know, reading the article.

The truth is that, despite the hype, Agile development doesn’t always work in practice.

True enough, but not a helpful statement, because “Agile” now means a lot of different things to different people. If we take it to mean the values, principles and practices written by the people who came up with the term, then I can readily believe that it wouldn’t work in practice for people whose context is different from those who came up with the ideas in 2001. Which may well be everyone.

I’m also very confident that it doesn’t mean that. I met a team recently who said they did “Agile”, and discussed their standups and two-week iterations. They also described how they were considering whether to go from an annual to biannual release.

Almost three quarters (73%) of CIOs think Agile IT has now become an industry in its own right while half (50%) say they now think of Agile as “an IT fad”.

The Agile-Industrial Complex is well-documented. You know what isn’t well-documented? Your software.

The report revealed 44% of Agile IT projects that fail, do so because of a failure to produce enough (or any) documentation.

The survey found that 34% of failed Agile projects failed because of a lack of upfront and ongoing planning. Planning is a casualty of today’s interpretation of the Agile Manifesto[…]

68% of CIOs agree that agile teams require more Architects. From defining strategy, to championing technical requirements (such as performance and security) to ensuring development teams stick to the rules of the game, the role of the Architect is sorely missed in the agile space. It must be reintroduced.

A bit near the top of the front page of the manifesto for agile software development is a sentence fragment that says:

Working software over comprehensive documentation

Before we discuss that fragment, I’d just like to quote the end of the sentence. It’s a long way further down the page, so it’s possible that some readers have missed it.

That is, while there is value in the items on the right, we value the items on the left more.

Refactor -> Inline Reference:

That is, while there is value in comprehensive documentation, we value working software more.

Refactor -> Extract Statement:

There is value in comprehensive documentation.

Now I want to apply the same set of transforms to another of the sentence fragments:

There is value in following a plan.

Nobody ever said don’t have a plan. You should have a plan. You should be willing to amend the plan. I was recently asked what I’d do if I found that my understanding of the “requirements” of a system differ from the customer’s understanding. It depends a lot on context but if there truly is a “the customer” and they want something that I’m not expecting to offer them, it’s time for me to either throw away my version or find a different customer.

Similarly, nobody said don’t have comprehensive documentation. I have been on a very “by-the-book” Agile team, where a developer team lead gave feedback that they couldn’t work out where a change would go to enable a particular feature. That’s architecture! What they wanted was an architectural plan of the system. Except that they couldn’t explicitly want that, because software architecture is so, ugh, 1990s and Rational Rose. Wanting an architecture diagram is like wanting to use CORBA, urrr.

Once you get past that bizarre emotional response, give me a call.

Input-Output Maps are Strongly Biased Towards Simple Outputs

About this paper

Input-Output Maps are Strongly Biased Towards Simple Outputs, Kamaludin Dingle, Chico Q. Camargo and Ard A. Louis, Nature Communications 9, 761 (2018).

Notes

On Saturday I went to my alma mater’s Morning of Theoretical Physics, which was actually on “the Physics of Life” (or Active Matter as theoretical physicists seem to call it). Professor Louis presented this work in relation to RNA folding, but it’s the relevance to neural networks that interested me.

The assertion made in this paper is that if you have a map of a lot of inputs to a (significantly smaller, but still large) collection of outputs, the outputs are not equally likely to occur. Instead, the simpler outputs are preferentially selected.

A quick demonstration of the intuition behind this argument: imagine randomly assembling a fixed number of lego bricks into a shape. Some particular unique shape with weird branches can only be formed by an individual configuration of the bricks. On the other hand, a simpler shape with large degree of symmetry can be formed from different configurations. Therefore the process of randomly selecting a shape will preferentially pick the symmetric shape.

The complexity metric that’s useful here is called Kolmogorov complexity, and roughly speaking it’s a measure of the length of a Universal Turing Machine program needed to describe the shape (or other object). Consider strings. A random string of 40 characters, say a56579418dc7908ce5f0b24b05c78e085cb863dc, may not be representable in any more efficient way than its own characters. But the string aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, which is 40 characters, can be written with the Python program:

'a'*40

which is seven characters long including the newline. Assuming eight bits per character, the random string needs 40*8=320 bits to be represented. The forty as can be found by actually finding the Python program, which is 56 bits. The assertion is that a “find a program that generates character sequences of length 40” algorithm (with some particular assumptions in place) will find the a56579… string with probablity 2^-320, but will find the aaa… string with probability 2^-56, which is much, much more likely.

In fact, this paper shows that the upper and lower bounds on the probability of a map yielding a particular output for random input are both dependent on the Kolmogorov complexity of the output. It happens that due to the halting problem, you can’t calculate Kolmogorov complexity for arbitrary outputs. But you can approximate it, for example using Lempel-Ziv complexity (i.e. the length of the input to a lossless compression algorithm needed to recover the same output).

Where does this meet neural networks? In a preprint of a paper selected for the ICLR 2019, with two of the same authors as this paper. Here, we find that a neural network can be thought of as a map between the inputs and weights to a function that does the inference.

Typically neural network architectures have lots more parameters than there are points in the training set, so how is it that they manage to generalise so well? And why is it that different training techniques, including stochastic gradient descent and genetic algorithms, result in networks with comparable performance?

The authors argue that a generalising function is much less complex than an overfitting function, using the same idea of complexity shown above. And that as the training process for the network is sampling the map of inputs->functions, it is more likely to hit on the simple functions than the complex ones. Therefore the fact that neural networks generalise well is intrinsic to the way they select functions from a wealth of possibilities.

My hope is that this is a step toward a predictive theory of neural network architectures. That by knowing the something of the function we want to approximate, we can set a lower bound on the complexity of a network needed to discover sufficiently generalisable functions. This would be huge for both reducing the training effort needed for networks, and for reducing the evaluation runtime. That, in turn, would make it easier to use pretrained networks on mobile and IoT devices.

HPC at FOSDEM 2019

This year’s FOSDEM featured an HPC, Big Data and Data Science devroom on the Sunday. This post is the first part of my notes on the topics presented there. If you are interested, book some time and let’s talk about what it means for your and your high-performance computing team.

OpenHPC Update

Adrian Reber from the OpenHPC project gave a refresher on what OpenHPC is, and a status update. OpenHPC has not been represented at FOSDEM since 2016, when the project was very new.

It’s a community-driven project with representation from many vendors and HPC sites. On first blush their output might appear to be “RPM packages” and “documentation” but their mission is actually to discover and share best practices in HPC management. Those packages are all well-tested with each other, and the documentation is tested every release, too. The idea is that if you build the core of your cluster with OpenHPC packages on CentOS-like Linux distributions, on either x86-64 or AArch64, you get to rely on tried and tested work from the whole community.

Reber, who works at Red Hat on their OpenHPC efforts, invited everyone to join the weekly project steering calls in a demonstration of the openness of the project. He discussed future directions, including an upcoming release v1.3.7 that will include packages rebuilt with the ARM HPC compiler for AArch64, and the challenges of understanding when is right to release v1.4 which will drop SLES12 for SLES15 and RHEL7 for RHEL8.

ReFrame

On the subject of HPC libraries, a common frustration is testing codes with various combinations of compilers, MPI libraries, hardware capabilities and so on. Developers both want to know that their code is correct (i.e. the science outcomes are still valid after a change) and that the performance has not been significantly impacted.

Victor Holanda discussed ReFrame, a tool for HPC regression and performance testing developed at CSCS and used regularly on Piz Daint and their other clusters. Written in Python, it gives test authors a way to express what their tests require (e.g. that they must run on machines with CUDA, compile a particular code with one of three different compilers, load environment modules with one of two different MPIs), run the tests, and inspect the output for certain outcomes.

Testers get to run a single command, or point their Jenkins or Travis CIs at a single command, to discover and execute the tests. The ReFrame runtime will compare the environments that the test can use with the ones that are available, and will report on the outcomes in each of those environments.

Inside CSCS, ReFrame is used for a 90 minute nightly production test run, and 10 minute maintenance runs to check for system regressions after configuration changes. They also have a set of diagnostic tests to help understand what’s happened if a node goes bad. Their approach to correctness is very robust; the team do not declare that they support something until it has enough users to know how well it works. They also say that in three years of development they “have never seen a python stacktrace” from ReFrame, as they test ReFrame with ReFrame while they are developing it.

Singularity Containers

Singularity from sylabs is a container runtime tool that specifically addresses problems containerising HPC workloads. Eduardo Arango gave a “what’s new in Singularity” update, as FOSDEM 2017 had already featured an introduction-level talk.

What’s new is that they’ve rewritten in Go. This means they get better integration with libraries used in Docker, Kube etc., and could adopt the de facto standard Containers Networking Interface for software-defined networking when running containers. It also reduces the dependencies needed to get Singularity up and running.

The new version uses a new format for containers, SIF (Singularity Image Format), a read-only SquashFS filesystem along with metadata, all of which can be cryptographically signed using PGP for integrity protection. An upcoming extension will allow a writable overlay to be added to a SIF.

Supporting this, Sylabs have a new container library similar to DockerHub for hosting SIF images for public or private cloud use. They have a key store for those PGP signing keys, and a cloud-based remote image builder for developers who need to build images but can’t do it locally.

Conclusion

This has been part one of my FOSDEM HPC round-up. I’ve focussed on the tools that are out there for automating and simplifying HPC workflows, because it’s an interesting problem and one that presents challenges to many HPC teams. Don’t forget that the Labrary can help!

How UX Practitioners Produce Findings in Usability Testing

The Paper

How UX Practitioners Produce Findings in Usability Testing by Stuart Reeves, in ACM Transactions on Computer-Human Interaction, January 2019.

Notes

Various features of this paper make it a shoe-in for Research Watch.

  • It is about the intersection between academia and commercial practice. That is where the word “Labrary” comes from.
  • It extends the usual “human-computer interaction” focus of UX to include the team performing the UX, which an aspect of PETRI.
  • I get to use the word “praxeology”.

Reeves compares the state of UX in the academic literature with the state of UX in commercial fields. He finds a philosophical gap that is similar to something I observed when studying “Requirements Engineering” on a Software Engineering M.Sc. course. Generally, the academic treatment of UX describes usability problems as things that exist, and that the task of UX activities is to find them.

The same can be seen in much early literature on requirements engineering. We assume that there is a Platonic model of how a software product should work, and that the job of the requirements engineer is to “gather” requirements from the stakeholders. Picture a worker with a butterfly net, trying to collect in these elusive and flighty requirements so they can pin them down in a display case made by the Jira Cabinet Company.

There’s an idea here that, even before it’s formed, the software is real and has an identity independent of the makers, users, and funders. Your role in the software production process is one of learning and discovery, trying to attain or at least approximate this ideal view of the system that’s out there to be had.

Contrasted with this is the “postmodern” view, which is a more emergent view. Systems and processes result from the way that we come together and interact. A software system both mediates particular interactions and blocks or deters others. The software system itself is the interaction between people, and developments in it arise as a result of their exchanges.

In this worldview, there are not “UX problems” to be found by adequate application of UX problem-discovery tools. There are people using software, people observing people using software, and people changing software, and sometimes their activities come together to result in a change to the software.

This philosophy is the lens through which Reeves engages in the praxeology (study of methods) of UX practitioners. His method is informed by ethnomethodological conversation analysis, which is an academic way of saying “I watched people in their context, paying particular attention to what they said to each other”.

The UX activity he describes is performed by actors in two different rooms. In the test room, the participant uses a computer to achieve a goal, with some context and encouragement provided by a moderator. The rest of the team are in the observation room, where they can see and hear the test room and the participant’s screen but talk amongst themselves.

Four representative fragments expose different features of the interactions, and to my mind show that UX is performative, arising from those interactions rather than being an intrinsic property of the software.

  • In fragment A, the participant reports a problem, the observers react and decide to report it.
  • In fragment B, the participant reports a problem, the observers react and suppress reporting it.
  • In fragment C, the participant does not seem to be having a problem, but the observers comment that they did not do something they would have expected, and discuss whether this is an issue.
  • In fragment D, the participant is working on the task but does not choose the expected approach, observers see that, and define a problem and a solution that encompasses that.

One observation here is that even where a participant is able to complete the task, a problem was raised. The case in fragment D is that the participant was asked how they would report a problematic advert. They described sending an email to the client. That would work. However, the product team see that as a problem, because they are working on the “submit a complaint” feature on the website. So, even though the task goal can be satisfied, it was not satisfied the way they want, which means there’s a UX problem.

There are all sorts of things to learn from this. One is that you can’t separate the world neatly into “ways humans do things” and “measurements of the ways humans do things”, because the measurements themselves are done by humans who have ways of doing things. Another is that what you get out of UX investigations depends as much on the observers as it does on the participants’ abilities. What they choose to collectively see as problems and to report as problems depends on their views and their interactions to an extent comparable to their observations of the participants working through the tasks.

Ultimately it’s more evidence for the three systems model. Your team, your software, and your customers are all interacting in subtle ways. Behaviour in any one of these parts can cause significant changes in the others.