Structure and Interpretation of Computer Programmers

I make it easier and faster for you to write high-quality software.

Thursday, June 25, 2020

SICPers podcast episode 9

In this episode I talk about Design by Contract. Episode RSS feed – also available in Apple and Google Podcasts.

posted by Graham at 18:31  

Friday, March 27, 2020

On the tyranny of autoincrementing integer primary keys

In designing a relational database schema, many people will automatically create a column id integer primary key for every table, using their database’s automatic increment feature to assign a new value to each row they insert. My assertion is that this choice of primary key should be the last resort, not the first.

A database schema is a design artifact, describing the data we want to store and the relationships between records (rows) in those data. It is also meta-design, because the schema constrains us in designing the queries we use to work with the data. Using the same, minimal-effort primary key type for every table then avoids communicating information about the structure and meaning of the data in that table and imposes irrelevant features in the queries we design.

The fact that people use the name id for this autoincrementing integer field gives away the fact that the primary key is used to identify a row in a database. The primary key for a table should ideally be the minimal subset of relevant information that uniquely identifies an individual record. If you have a single column, say name, with not null and unique constraints, that’s an indicator (though not a cast-iron guarantee) that this column may be the table’s primary key. Sometimes, the primary key can be a tuple of multiple columns. A glyph can be uniquely identified by the tuple (character, font, swash) for example (it can, regardless of whether this is how your particular favourite text system represents it, or whether you think that this is a weird way to store ligatures). The glyphs “(e, Times New Roman Regular 16pt, normal)” and “(ct, Irvin Demibold 24pt, fancy)” are more readily recognisable than the glyphs “146276” and “793651”, even if both are ways to refer to the same data. A music album is identified by the artist and the album name (he says, side-eyeing Led Zeppelin): “A Night at the Opera” is ambiguous while “(Blind Guardian, A Night at the Opera)” is definitely not “(Queen, A Night at the Opera)”.

Use an integer identifier where there is no other way to uniquely identify rows in a table. Note: sometimes there is another, more meaningful way, even where that just means using somebody else’s unique identifier: different copies of the same book will have unique shelfmarks if they’re part of a library, for example. People in an organisation may have an employee number, or a single sign-on user name; though there may be privacy reasons not to use these.

A side-effect of using useful information to identify rows in a database is that it can simplify your queries, because where your foreign keys would otherwise be meaningless numbers, they now actually carry useful information. Here’s an example, from a real application, in which I’m sad to say I designed both the “before” and “after” schemata.

The app is a risk management tool. There are descriptions of risks (I’d like to believe that they all at least have a distinct description but I can’t be sure, so those will use integer id PKs), and for each risk there are people in certain roles who bring particular skills to bear on mitigating the risk. The same role can be applied to more than one risk, the same skill can be applied by more than one role, and one role may apply multiple skills, so there’s a three-way join to work out, for a given risk, what roles and skills are relevant.

The before schema:

create table risk (id integer primary key, description varchar not null, weight integer, severity integer, likelihood integer); -- many fields elided
create table role (id integer primary key, name varchar not null, unique(name)); -- ruh roh
create table skill (id integer primary key, name varchar not null, unique(name)); -- the same anti-pattern twice
create table risk_role_skill (id integer primary key, risk_id integer, role_id integer, skill_id integer, foreign key(risk_id) references risk(id), foreign key(role_id) references role(id), foreign key(skill_id) references skill(id));

In this application, we start by looking at a list of risks then inspect one to see what roles are relevant to mitigating it, and then what skills. So a valid question is: “given a risk, what roles are relevant to it?”

select distinct role.name inner join risk_role_skill on role.id = risk_role_skill.role_id where risk_role_skill.risk_id = ?;

But if we notice the names of each role and skill are unique, then we can surmise that they are sufficient to identify a given role or skill. In fact, the only information we have about roles or skills are the names.

create table risk (id integer primary key, description varchar not null, weight integer, severity integer, likelihood integer); -- many fields elided
create table role (name varchar primary key); -- uhhh...
create table skill (name varchar primary key); -- this still looks weird...
create table risk_role_skill (id integer primary key, risk_id integer, role_name varchar, skill_name varchar, foreign key(risk_id) references risk(id), foreign key(role_name) references role(name), foreign key(skill_name) references skill(name));

Here’s the new query:

select distinct role_name from risk_role_skill where risk_id = ?;

We’ve removed the join completely!

Two remaining points:

  1. There’s literally no information carried in the role and skill tables now, other than their identifying names. Does that mean we need the tables at all? In this case no, but in general we need to think here. How are the names in the join table going to get populated otherwise? If there are a limited set of valid values to choose from, then keeping a table with the range of values and a foreign key constraint to that table may be a good way to express the intent that the column’s content be drawn from that range. As an example, a particular bookstore may have printed, ebook, and audiobook media, so could restrict the medium field in their stock table to one of those values.
  2. Why does the risk_role_skill table have an identifier at all? It is a collection of associations between values, so a row’s content is that row’s identity.

Here’s the after schema:

create table risk (id integer primary key, description varchar, weight integer, severity integer, likelihood integer); -- many fields elided
create table risk_role_skill (risk_id integer, role varchar, skill varchar, foreign key(risk_id) references risk(id), primary key(risk_id, role, skill));

And the after query:

select distinct role from risk_role_skill where risk_id = ?;

Two fewer tables, no joins, altogether a much simpler database to understand and work with.

posted by Graham at 18:31  

Thursday, October 24, 2019

Zen and the Art of Software Maintenance

In one part of the book Zen and the Art of Motorcycle Maintenance, which is neither about Zen nor motorcycle maintenance, there are two motorcycles and two riders. John Sutherland is a romanticist who appreciates the external qualities of his motorcycle: its aesthetics, and its use as a vehicle. The narrator is a classicist who appreciates the internal qualities of his motorcycle: its workings, parts, and mechanisms. When Sutherland has a problem with his bike he takes it to a mechanic. When the narrator does, he rationalises about the problem and attempts to discover a solution.

The book, which as its subtitle gives away is “an inquiry into values”, then follows the narrator’s exploration of a third way of considering quality that marries the romantic and classical notions holistically.

Now we come onto software. Software doesn’t exist. At some level, its abstractions and mathematics get translated into a sequence of states of an electronic machine that turns logic into procedure: but even that is a description that’s a few degrees abstracted from what software and computers really do.

Nonetheless, software has external and internal qualities. It has aesthetics and utility, and can be assessed romantically. A decidedly pedestrian word to describe the romanticist view of software is “requirements”, but it’s a common word in software engineering that means the right thing.

Software also has workings, parts, and mechanics. Words from software engineering to describe the classical view of software include architecture, design, clean code, SOLID…

…there are many more of these words! Unsurprisingly, the people who build software and who change software tend to take a classical view of the software, and have a lot more words to describe its internal qualities than its external qualities.

Typically, the people who are paying for software are interested in the romantic view. They want it to work to achieve some goal, and want someone else (us!) to care about what makes it work. Perhaps that’s why so many software teams phrase their requirements as “As a romantic, I want to task so that I can goal.”

Which is to say that making software professionally involves subordinating classical interpretations of quality to romantic interpretations. Which is not to say that a purely-classical viewpoint is unvaluable. It’s just a different thing from teaching a computer somersaults for a paying audience.

And maybe that subordination of our classical view to the customer/gold owner’s romantic view is the source of the principles:

Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.

and:

Working software is the primary measure of progress.

In fact, this second one is not quite true. It suggests that you could somehow “count software”, and the more (working) software you’ve delivered, the better you’re doing. In fact, romanticism shows us that people only want software in that it enables some process or business opportunity, or makes it more efficient, or reduces errors, or lets them enjoy some downtime, or helps them achieve some other goal. So really progress toward that goal is the primary measure of progress, and working software is a leading metric that we hope tells us how we’re working toward that goal.

So all of those code quality and software architecture things are in support of the external view of the software, which is itself in support of some other, probably non-software-related, goal. And that’s why the cleanliness, or architectural niceness, or whatever classical quality, of the code is not absolute, but depends on how those qualities support the romantic qualities of the code.

Real life comes at you fast, though. When you’re working on version 1, you want to do as little work, as quickly as possible, to get to the point where you can validate that there are enough customers who derive enough value to make the product worthwhile. But by the time you come to work on version 1.0.1, you wish you’d taken the time to make version 1 maintainable and easy to change. Most subsequent versions are a little from column A and a little from column B, as you try new things and iterate on the things that worked.

As fast as possible, but no faster, I guess.

posted by Graham at 23:51  

Wednesday, January 2, 2019

The App that Wasn’t (Yet)

One of the early goals written into the mission statement of the Labrary was an eponymous app for organising research notes. I’ve used Mekentosj Springer Readcube Papers for years, and encountered Mendeley and others, and found that they were all more focussed on the minutiae of reference management, rather than the activity of studying and learning from the material you’re collecting in your library. Clearly those are successful apps that have an audience, but is there space for something more lightweight?

I talked to a few people, and the answer was yes. There were people in software engineering, data science, and physics who identified as “light” consumers of academic literature, people who read the primary literature to learn from and find techniques to apply, but do not need or even want the full cognitive weight of bibliographic reference management. They (well, “we”, I wanted it too) wanted to make notes while they were reading papers, and find those notes again. We wanted to keep tags on interesting references to follow up. We wanted to identify the questions we had, and whether they were answered. And we wanted to have enough information—but not more—to help us find the original article again.

My first prototype was as simple as I could make it. There’s a picture below: it’s a ring binder, with topic dividers, and paper notes (at least one separate sheet for each article) which quickly converged on a pro forma layout as shown.

An early prototype of the Labrary app.

An early prototype of the Labrary app.

I liked it, in fact I quickly got to a point where I wouldn’t read an article unless I had access to a pad and pen to add a page to my binder. People I showed it to liked it, too. So this seemed like a good time to crack open the software making tools!

The first software prototype was put together in spare time using GNUstep and Renaissance, and evinced two problems:

  • The UI design led back down the route of “bibliopedantry”, forcing students to put more effort into getting the citation details correct than they wanted to.
  • Renaissance lacked support for some Cocoa controls it would have been helpful to use, so there was a choice to be made to invest more into improving Renaissance or finding a different UI layout tool.
A screenshot of the ill-fated "Library" window in Labrary's GNUstep prototype.

A screenshot of the ill-fated “Library” window in Labrary’s GNUstep prototype.

This experience made me look for other inspiration for ways to organise the user interface so that students get the experience of taking notes, not of fiddling with citation data. I considered writing Labrary as a plugin for the free Calibre e-reader app, so that Labrary could focus on being about study notes and Calibre could focus on being about library management. But ultimately I found the tool that solved the problem best: Apple’s Finder.

The Labrary pro forma note as Finder stationery.

The Labrary pro forma note as Finder stationery.

I’ve recreated the pro forma note from the binder as a text file, and set the “Stationery Pad” flag in the Finder. When I open this file, Finder creates a duplicate and opens that instead, in my editor of choice: ready to become a new study note! I put this in a folder with a Zim index file, so I can get the “shoebox” view of all the notes by opening the folder in Zim. It also does full-content searching, so the goal of finding a student’s notes again is achieved.

Zim open on my research notes folder.

Zim open on my research notes folder.

I’m glad I created the lo-fi paper prototype. It let me understand what I was trying to achieve, and show very quickly that my software implementation was going in the wrong direction. And I’m always happy to be the person to say “do we need to write this, or can it be built out of other bits?”, as I explored for this project with Zim and Calibre.

posted by Graham at 11:23  

Powered by WordPress