Opinionated Read: How AI Impacts Skill Formation


The abstract to this preprint, by two authors both associated with Anthropic, makes the claim “We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.”

The first thing to appreciate is that this idea of “safety-critical domains” does a lot of heavy lifting when it comes to software professionalism—on the one hand, engineers say that while (intervention that is perhaps validated in controlled trials or in experience reports but not the way that engineers like to work) is clearly something that those safety-critical folks should concern themselves with, it’s not relevant to (domain that includes the engineers’ work). On the other hand, professional organisations in the field of computing refuse to engage with the idea that a software engineer should be expected to learn the software engineering body of knowledge precisely because it doesn’t have anything to say about how to build safety-critical software.

Now what is safety-critical software? Or, if you can’t answer that, what software isn’t safety critical? The book Moral AI tells the story of a self-driving car that collided with, and killed, a pedestrian who was pushing a bicycle across the road. The driver (for lack of an agreed and accurate word) reports streaming an episode of The Voice in the background while checking in to her employer’s work chat channels on the car’s central console. Is the car’s autonomous driving system safety-critical in this context? What about the built-in collision avoidance system, that the employer had disabled in this vehicle? How about the streaming software, or the chat application, or the operating systems on which they run? All of these contributed to a fatality, what makes any of them safety-critical or not?

The second thing is that the claim in the abstract is about learning outcomes, skill formation, and efficiency gains. We need to go into reading this paper keeping those terms in mind, and asking ourselves whether this is actually what the authors discuss. Because we care about what they did and what they found, and aren’t so worried about the academic context in which they want to present this work, let’s skip straight to section 4, the method.

What did they do?

Unfortunately, we don’t learn a lot about the method from their method section, certainly not enough to reproduce their results. They tell us that they use “an online interview platform with an AI chat interface”, but not which one. The UI of that platform might be an important factor in people’s cognition of the code (does it offer syntax highlighting, for example? Does it offer a REPL, or a debugger? Can it run tests?) or their use of AI (does it make in-place code edits?).

In fact when we read on we find that in a pilot study they found that such a platform (they call it P1) was unsuitable and that they switched to another, P2. Choosing a deliberately uncharitable reading, P1 is probably Anthropic’s regular platform for interviewing engineering candidates, and management didn’t want their employees saying it has problems because Silicon Valley culture is uniquely intransigent when it comes to critiquing their interviewing practices (if you think the interview system is broken, you’re saying there’s a chance that I shouldn’t have been given my job, and that’s too horrible to contemplate). Whether that’s true or not, we’re left not knowing what a participant actually saw.

The interviewing platform has “AI”, and the authors tell us the model (GPT-4o); this piece of information leads me to put more weight on my hypothesis about the interview platform name. It subtly reframes the paper from “AI has a negative impact on skills acquisition” to “our competitor’s product has a negative impact on skills acquisition”; why mention this one product if you took a principled position on anonymising product names? Unfortunately that’s all we get. “The model is prompted to be an intelligent coding assistant.” What does that mean? Prompted by whom? Did the experimenters control the system prompt? What was the system prompt? Was the model capable of tool use? What tools were available? Could participants modify the system prompt?

So now what we do know; 52 people (said in the prose to be split 26 in the “no-AI” control group, and 26 in the “AI access” treatment group; table 1 doesn’t quite add up that way) were given a 10 minute coding challenge, then a 35-minute task to make use of a particular Python library, either with or without AI assistance. Finally, they have 25 minutes to answer questions related to the task: an appendix lists examples of the kinds of questions used, but not the actual question script. This is another factor that negatively impacts replicability.

What did they find?

There’s no significant difference in task completion time between the two groups (people who used AI, and people who didn’t use AI). That is, while the mean task completion time is slightly lower for the AI group, the spread of completion times is such that this doesn’t indicate a meaningful effect.

Overall, people who used AI did less well on the test that they took immediately after completing the task, and this is a meaningful effect.
However, looking at their more detailed results (figure 7), it seems that among developers with less Python experience (1-3 years), the task completion time was vastly improved by AI access, and the outcome on the quiz was not significantly different.

Remember the sentence in the abstract was “Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library.” A different observation is “Participants with only a few years of Python experience showed significant productivity improvements, at no cost to learning the library”.

But does the quiz provide evidence of “having learned the library”? It’s a near-immediate test to recall knowledge about a task that the participants had just completed. What would the researchers have found if they waited four hours, or one day, or one week, to give the test? What would they have found if they set “learning the library” as an explicit task, and gave people time (with or without access to AI) to study? Would it make a difference if this study time was undertaken before participants undertook the coding task, or after? The authors find that some participants used significant time asking the AI assistant questions about the problem. In this way, they measure the total time taken to learn and solve the coding problem, in a situation where you’ve been given a total of 35 minutes for both.

The authors performed some qualitative analysis of their data. They find that people in the no-AI condition encounter more errors (syntax or API misuse) than people who use AI: this should be an interesting result, and a challenge to anyone who prejudges AI-generated code as “slop”. In this situation, it would seem that human-generated code is sloppier (at least in the first few minutes after it’s created).

They identify six AI interaction patterns among their treatment group, and that people who used three of the patterns achieved better results (though we can’t comment on significance as we no longer have statistics) than the control group on the quiz outcome, without impact on “productivity”. As someone who has attached their wagon to the horse of intentional use of AI to improve software engineering skills, this should give me the warm fuzzies. In the context of the control validity questions of the study, I don’t know that they’ve necessarily demonstrated such improvement.

At this point I have another confounding factor to add to these results: the researchers questioned participants on their programming experience, but not on their experience using AI assistants (beyond recruiting people with non-zero experience). Do the adoption of these patterns correlate with more experience using AI? Do people with more experience using AI get more productivity when they use AI? We can’t tell.

And we also can’t say anything about the skill of using an AI assistant itself. Participants are asked about their understanding of the Python library, and the authors transfer their performance answering these questions into a measure of “skill acquisition” learned in using the library. Is that the skill they exercised? Do the quiz answers tell us anything about that skill? If participants were asked a week later to complete a related task, would their performance correlate with the quiz results? Is using the Python library even a useful skill to have?

The authors observed that one pattern performed far worse on both task-completion time and quiz responses than all the others, and this was “Iterative AI Debugging”: verifying or debugging code using the AI. This result isn’t surprising, because the pattern represents using the model to evaluate the logic embodied in the code, and language models don’t evaluate logic. They’re best suited to what used to be called “caveman debugging” where you use print statements to turn dynamic behaviour into a sequence of text messages—because the foodstuff of the model is sequences of text. They don’t evaluate the internal state and control flow of software, so asking them to make changes based on understanding that internal state or control flow is unlikely to succeed. However, given the small amount of data on this debugging pattern, this is really a plausible conjecture worthy of follow up, not a finding.

This preprint claims that using AI assistants to perform a task harms task-specific skill acquisition without significantly improving completion time. What it shows is that using AI assistants to perform a task leads to a broad distribution of ability to immediately answer questions related to the completion of the task, with an overall slight negative effect, without significantly improving completion time. The relation of the acquired knowledge to knowledge retention, or to skill, remains unexplored.

Posted in AI | Leave a comment

Creating “sub-agents” with Mistral Vibe

The vibe coding assistant doesn’t have the same idea of sub-agents that Claude Code does, but you can create them yourself—more or less—from the pieces it supplies. [UPDATE: vibe 2.0 supports subagents directly.]

Write the prompt for the sub-agent in a markdown file, and save it to ~/.vibe/prompts/<agent-name>.md. For example:

# Test suite completer
You are an expert software tester. You help the user create a complete and valuable test suite by analyzing their software and their tests, identifying tests that can be added, and constructing those tests.
Use test design principles, including equivalence partitioning and boundary value analysis, to identify gaps in test coverage. Review the project's documentation, including comment docs and help strings, to determine the software's intended behavior. Design a suite of tests that correctly verifies the behavior, then investigate the existing test code to determine whether all of the cases you designed are covered. Add the tests you identify as missing.
## Workflow
1. Read the user's prompt to understand the scope of your tests.
2. Discover documentation and code comments that describe the intended behavior of the system under test.
3. Design a suite of tests that exercise the system's intended behavior, and that pass if the system behaves as expected and fail otherwise.
4. Search the existing test code for tests that cover the behavior you identify.
5. Create tests that your analysis indicates are necessary, but that aren't in the existing test suite.
6. Report to the user the tests you created so they can review and run the tests.

Write a configuration for an agent that uses this system prompt, and save it to ~/.vibe/agents/<agent-name>.toml. For example:

active_model = "devstral2-local"
system_prompt_id = "test-suite-completer"
[tools.read_file]
permission = "always"
[tools.write_file]
permission = "always"
[tools.search_replace]
permission = "always"

The active_model needs to be a model that you define in ~/.vibe/config.toml, or you can omit it to use the default model. The system_prompt_id needs to match the filename you give the system prompt file, without the .md extension.

You can use this agent by passing the --agent option to vibe, for example I use the following shell script, and create a symbolic link to the script that has the same name as the agent I want to use:

#!/bin/sh
if [ $# -ne 1 ]; then
    echo "Usage: $0 [prompt]"
    exit 1
fi
AGENT=$(basename "$0")
vibe --agent "$AGENT" --prompt "$1"

You can now use this agent directly at the command line, or tell vibe about the script so that it invokes your agent as a sub-agent.

Posted in AI | Leave a comment

Announcing Chiron Codex, a community of software centaurs

Software engineers don’t need to outsource our agency to coding agents. We don’t need to give up reading the code, or understanding the problems. We can use AI tools to augment our own capabilities, to improve our engineering knowledge and skills. To become software centaurs.

Chiron Codex is an initiative to do just that. In the short term, I’m creating a pattern language of AI-augmented software engineering, and a community of people who want to use AI to become better software engineers at Patreon and at YouTube. Longer term, we’ll explore ways to improve at all aspects of the software engineering lifecycle; becoming software generalists who use AI to complement our expertise, and our expertise to direct the AI tools.

Join us, and please consider supporting Chiron Codex by subscribing to the Archaeopteryx (super-early bird; so early birds haven’t even evolved) tier on Patreon! Here’s a video explaining the benefits.

Posted in AI | Leave a comment

Configuring your computer for local inference with a generative AI coding assistant

You can use multiple tools to download, host, and interact with large language models (LLMs) for generative tasks, including coding assistants. This post describes the one that I tried that has been the most successful. Even if you follow the approach below and it works well for you, I recommend trying different combinations of LLM and coding assistant so that you can find the setup that’s most ergonomic.

Choose hardware

You need to use a computer with either sufficient GPU, or dedicated neural processing, capacity to run an LLM, and enough RAM to hold gigabytes of parameters in memory while also running your IDE, software under development, and other applications. As an approximate rule, allow 1GB for every billion parameters in the model.

I chose Mac Studio with M3 Ultra and 256GB RAM. This computer uses roughly half of its memory to host the 123 billion parameter Devstral 2 model. A computer with 32GB RAM can run a capable small model; for example, Devstral Small 2: in this walkthrough I’ll show how to set up that model using Mistral Vibe as the coding assistant.

Note that once you have the model working locally, you can share it on your local network (or, using a VPN or other secure channel, over the internet) and access it from your other computers. You only need one computer on your network to be capable of hosting the LLM you choose to use local inference from any computer on that network.

Install LM Studio

Visit LM Studio and click the download button. Follow the installation process for your operating system; in macOS, you download a DMG that you open, and drag the app it contains into your Applications folder.

Download the model

Open LM Studio, and open the Model Search view by clicking on the magnifying glass. In macOS, check the MLX box to use the more efficient MLX format, and leave GGUF unchecked. Search for “Devstral Small 2 2512”, and click Download to download its weights and other configuration data. The second number (2512) refers to the release date of the model—in this case, December 2025.

Load the model and test it

When your model is downloaded, switch to the Chats view in LM Studio. In the window toolbar, click “Select a model to load” and choose the model you just downloaded. Optionally, toggle “Manually choose model load parameters” and configure settings. I typically alter the context size, as the default model size is 4096 tokens which optimises for inference speed and small memory footprint over a large “working set”. Click “Load Model” to tell LM Studio to serve the model. You can also tell LM Studio to use the model’s maximum context size as the default whenever it loads a model, in the app’s settings.

When LM Studio loads your chosen model, it opens a new chat with the model. Type a prompt into this chat to validate that the model is working, and has enough resources for inference tasks.

Download, configure, and test a coding assistant

Coding assistants typically expect to take an API key, and communicate with a model hosted in the cloud. To use a local LLM, you need to configure the assistant.

Follow the instructions in the Vibe studio README to install the tool. In Terminal, run mkdir ~/.vibe. Use your text editor to save the following content in a file called ~/.vibe/config.toml:

active_model = "devstral2-small-local"

[[providers]]
name = "lmstudio"
api_base = "http://localhost:1234/v1"
api_key = "LM_STUDIO_API_KEY" # LM Studio doesn't use this value
api_style = "openai"
backend = "generic"

[[models]]
name = "mistralai/devstral-small-2-2512"
provider = "lmstudio"
alias = "devstral2-small-local"
temperature = 0.2
input_price = 0.0
output_price = 0.0

Now test the assistant by running vibe in Terminal, and typing a prompt into the assistant.

Further learning

I’ve recently started Chiron Codex, an initiative to create software engineering centaurs by augmenting human knowledge of the software craft with AI assistance. You can find out more, and support the project, over on Patreon. Thank you very much for your support!

Posted in whatevs | Leave a comment

Management is the wrong analogy for LLM augmentation

A common meme at the moment in AI-augmented coding circles is “we are all managers now”, with people expressing the idea that alongside actual programming, programmers now manage their team of agents. This is a poor analogy, in both directions. Treating an interaction with an AI agent like a manager-report interaction would lead to a poor experience for using the agent. Treating an interaction with a direct report like you’re using an AI tool would likely result in a visit from your HR representative.

In my experience of being managed and of being a manager in software companies, the good managers I’ve had and aspire to be are the ones who Camille Fournier describes in The Manager’s Path:

Managers who care about you as a person, and who actively work to help you grow in your career. Managers who teach you important skills and give you valuable feedback. Managers who help you navigate difficult situations, who help you figure out what you need to learn. Managers who want you to take their job someday. And most importantly, managers who help you understand what is important to focus on, and enable you to have that focus.

An LLM doesn’t have a job or a career path, or growth goals, or learn from your interactions. You can’t really tell it what’s important to focus on, you can just try to avoid showing it things you don’t want it to focus on. An LLM never gets into a difficult situation; the customer is always “absolutely right!”

Treating an LLM like a direct report can only lead to frustration. It isn’t a person who wants to succeed at its job, to learn and grow in its role, or become more capable. Indeed, it can’t do any of those things. It’s a tool. A tool that happens to have an interface that seems superficially similar to talking to it.

And that means that the correct way to treat an AI agent or coding interface is like a tool: it’s a text editor with a chat-like interface, a nondeterministic build script, or a static analysis tool. You’re looking for the correct combination of words and symbols to feed in to make the tool produce the output you want.

Treating a person who reports to you in that way would be unsatisfying and ultimately problematic. You don’t find different ways to express your problem statement until they solve it the way you would have solved it. You don’t give them detailed rules files with increasingly desperate punctuation around the parts it’s ## MANDATORY! that they follow. You find a way to work together, to teach each other, and to support each other.

If you’re really looking for an analogy with human-human interactions, then working with an outsource agency is slightly more accurate (particularly one located in a different place with a different culture and expectations, where you have to be more careful about communication because you can’t rely on shared norms and tacit knowledge being equivalent). You do, in such cases, work on a clearly-scoped task or project, with written statements of work. and clear feedback points. You still expect it to get better and easier over time, for the agents to learn and adapt in ways that LLM-based tools don’t, and to show initiative when faced with unstated problems in ways that LLM-based tools can’t. And the outsource agents still expect to be treated as peers and experts, helping you out by doing the work that you don’t have the capacity for. It’s better, but not great, as analogies go.

Unfortunately the best analogy we have for “precisely expressing problem statements in such a way that a computer generates the expected solution” is exactly the kind of thing that many people in the LLM world would like to claim isnt happening.

Posted in AI, tool-support | Leave a comment

Is spec-driven development the end of Agile software development?

A claim that I’ve seen on software social media is that spec-driven development is evidence that agile was a dark path, poorly chosen. The argument goes that Agile software development is about eschewing detailed designs and specifications, in favour of experimentation and feedback. Spec-driven development shows that the way to unlock maximal productivity in augmented development (and therefore in development overall, as LLM agents can type faster than programmers can) is by writing detailed specifications for the code generator to follow. Therefore detailed specifications are valuable after all, so Agile was wrong to do away with them.

Let’s take a look at what’s going on in—and behind—this argument.

Agile is alleged to eschew detailed specification.

It certainly looks that way, on a quick glance. As Brian Marick noted, one of the Agile Alliance didn’t care what was in the manifesto so long as there was no way IBM or Rational agreed with it. The people who came together for that skiing holiday in Utah were coaches and practitioners of ‘lightweight methodologies’, that favoured early delivery of working software to the customer over detailed product requirements documents, and design specifications, that the customer reviews and signs off before they ever see any code—indeed before any developer is allowed to begin coding.

The reason for this isn’t to create unspecified software. It’s to discover the specification through iterative feedback. If you compare the output of a team that follows Agile software development with one that follows a then-prevailing methodology, such as the Rational Unified Process (RUP) that irked one of the alliance members, you’d probably learn that the agile team actually has a much more detailed specification.

In addition, that spec is machine-readable, executable, and generates errors whenever the software falls out of conformance. The specification these teams produce through their emergent process is the test suite. If every software change follows the creation of a failing test, then all of the software—every feature, every bug fix—is specified in a detailed document.

Three evident differences between the ‘specs’ in something like RUP, and something agile:

  1. The agile spec’s format is closer to working software, which is what the customers ultimately value.
  2. The agile spec has no ambiguity: the software meets the spec and the test passes; or it doesn’t, and the test fails.
  3. The agile spec evolves throughout the life of the software endeavour, capturing elements of a dialectic between creators, customers, and software. Meanwhile, according to figure 1.5 in The Unified Software Development Process, ‘rational’ developers finish elaboration at an early point in the project and move on to filling in the details.

And one big difference between the purpose of the tests: RUP creates tests for verification: did we build it right in the implementation phase? Agile teams create tests that also supply validation: do we understand what we’re being asked to build?

Aside: the “specs” in spec-driven development occupy the same conceptual space as tests created to drive out design.

A developer listens to a description of behaviour that their software should embody, and writes a test to drive out the details and confirm they understand the situation. When happy that the test represents a description of the code they need to create, they write the code.

A developer listens to a description of behaviour that their software should embody, and writes a document to drive out the details and confirm they understand the situation. When happy that the document represents a description of the code they need to create, they create the code.

These are the same statement, made at different levels of abstraction with respect to the tools the developer uses. In other words, the people are doing the same thing, using different tools. If you “have come to value individuals and interactions over processes and tools”, then you will probably think that there is some value in the tools; but not as much as there is value in their application. Speaking of which…

The agile manifesto says that the alliance members value specifications.

The template statement in the manifesto is “[blah blah blah] we have come to value x over y. That is, while there is value in ys:[y], we value xs:[x] more.”

The instance that’s immediately relevant to the spec-driven development argument is x = working software; y = comprehensive documentation, where detailed specification is an example of comprehensive documentation (particularly in the prolix style adopted by many of the models). Performing the substitution, detailed specification is indeed valuable, but not as an end in itself: it is (or can be) useful in pursuit of delivering valuable software.

Agile software development isn’t about not doing things; it’s about understanding why you do anything, and being ready to try something else in pursuit of your goal. With the constraint that your goal needs to be “early and continuous delivery of valuable software”.

Returning to the aside, above, x = individuals and interactions and y = processes and tools. Spec-driven development is a process, generative AI is a tool; the point isn’t to use or avoid either, they can be valuable in the pursuit of working software.

Agile software development was about who makes software and how.

The comparison between RUP and lightweight methodologies made above was particularly apposite at the moment the manifesto for Agile software development was created; a single moment that highlighted a tension in a dichotomy. It isn’t the opposition of documentation and software, or change and plans. It’s the opposition of practitioner-led versus managerial software development.

The summary of the principles behind the manifesto is approximately ‘get people who know what they’re doing, let them talk to the customer and give the customer software, then get out of their way’. The list has an emphasis on technical excellence, simplicity, customer collaboration, and—crucially—self-organisation, all seconded to the ‘highest priority’ of customer satisfaction through valuable software, with ‘working software’ as the primary measure of progress.

In other words, we promise to make software, if you promise to let us.

The prevailing, heavyweight processes were predicated on a breakdown of this compact. Managers don’t trust developers not to gold-plate software for its own sake, continually polishing but never shipping. Therefore, it’s up to managers to work out what needs to be done and to keep developers’ noses pressed to the code face until they do it.

Bring that up to date, and every software organisation now pays lip service to agile software development, and yet the opposition between practitioner-led and managerial software development still exists. When developers of the 2020s complain about agile, they typically complain about facets of managerial agile: valueless ceremonial meetings; OKRs as the primary measure of progress; and perpetual sprinting with no warm-ups, cool-downs, or rests.

All of which is to say that the story of practitioner-driven software development remains partially told, and that whether spec-driven development contributes to its continuation is only a small part of the question; a question that remains open-ended.

Posted in agile, AI | Leave a comment

On the value of old principles

People using AI coding assistants typically wrestle with three problems (assuming they know what they’re trying to get the model to do, and that that’s the correct thing to try to get it to do):

  • Prompt. How to word the instructions in a way that yields the desired outcome, especially considering the butterfly effect that small changes in wording can lead to large changes in result.
  • Context. The models deal in a tokenised representation of information, and have capacity to deal with a finite list of tokens.
  • Attention. The more things a model is instructed to attend to, the less important is each thing’s contribution to the generated output stream. This tends to follow a U-shaped distribution, with the beginning and end of the input stream being more important than the middle.

(It’s important to bear in mind during this discussion that all of the above, and most of the below, is a huge mess of analogies, mostly introduced by AI researchers to make their research sound like intelligence, and tools vendors to make their models sound like they do things. A prompt isn’t really “instructions”, models don’t really “pay attention” to anything, and you can’t get a model to “do” anything other than generate tokens.)

Considering particularly the context and attention problems, a large part of the challenge people face is dividing large amounts of information available about their problem into small amounts that are relevant to the immediate task, such that the model generates a useful response that neither fails because relevant information was left out, nor fails because too much irrelevant information was left in.

Well, it turns out human software developers suffer from three analogous problems too: failing to interpret guidance correctly; not being able to keep lots of details in working memory at once; and not applying all of the different rules that are relevant at one time. As such, software design is full of principles that are designed to limit the spread of information, and that provide value whether applied for the benefit of a human developer or a code-generation model.

Almost the entire point of the main software-design paradigms is information hiding or encapsulation. If you’re working on one module, or one object, or one function, you should only need to know the internal details of that module, object or function. You should only need to know the external interface of collaborating modules, objects, or functions.

Consider the Law of Demeter, which says approximately “don’t talk to your collaborators’ collaborators”. That means your context never needs to grow past immediate collaborators.

Consider the Interface Segregation Principle, which says approximately “ask not what a type can do, ask what a type can do for you”. That means you never need to attend to all the unrelated facilities a type offers.

Consider the Open-Closed Principle, which says approximately “when it’s done, it’s done”. That means you never need concern yourself with whether you need to change that other type.

Consider the Pipes and Adapters architecture, which says approximately “you’re either looking at a domain object or a technology integration, never both”. That means you either need to know how your implementation technology works or you need to know how your business problem works, but you don’t need details of both at the same time.

All of these principles help limit context and attention, which is beneficial when code-generating models have limited context and attention. Following the principles means that however large your system gets, it never gets “too big for the model” because the model doesn’t need to read the whole project.

Even were the models to scale to the point where a whole, ridiculously large software project fits in context, and even were they to pay attention to every scrap of information in that context, these principles would still help. Because they also help limit the context and attention that us humans need to spend, meaning we can still understand what’s going on.

And for the foreseeable, we still need to understand what’s going on.

Posted in AI, design, OOP | Leave a comment

Vibe coding and BASIC

In Vibe Coding: What is it Good For? Absolutely Nothing (Sorry, Linus), The Register makes a comparison between vibe coding today and the BASIC programming of the first generation of home microcomputers:

In one respect, though, vibe coding does have an attractive attribute that is difficult to find elsewhere. For the first time since instant-on home computers that fell straight into a BASIC interpreter, it’s possible to make things happen with a bit of typing as a very naive user.

A photograph of the back cover of issue 7 of Input magazine, showing that PEEK and POKE are coming in issue 8.

A big difference between the two scenarios is that on an early micro, you had to use BASIC to get anything else done, in many situations. The computer I used was a Dragon 32 and unless you had a (very expensive) game or package that came on a ROM cartridge, even loading software from cassette required a BASIC command.

Actually it was one of two BASIC commands: you typed CLOAD to load a basic program, or CLOADM to load a machine-language program. Either way, the default behaviour of the computer was to load its BASIC interpreter from ROM and wait for input, that input being in the form of lines of BASIC.

The context of those two commands indicates a big similarity between BASIC and vibe coding: to get particularly far, you needed some knowledge of the workings of the rest of the computer. In this case, you needed to know whether your cassette tape contained a BASIC program or a machine-language program, but that isn’t the most egregious example.

As I said, the computer I used at the time was a Dragon 32, which was a kind of not-clone of the Tandy Color Computer. Let’s say that I wanted to write a game, which is entirely plausible because I both wanted to, and did, write many games. How do I read the joystick direction and fire button status in BASIC, so I can add joystick control to my games?

Direction is relatively easy. There are two potentiometer-based joysticks, right and left, and each has an x and a y axis. The JOYSTK function takes an index which identifies the axis to read, and returns a value from 0 to 63 indicating the extent of travel. For example, JOYSTK(0) is the x axis of the right joystick.

To get the fire button status, I use this operation: the Peripheral Interface Adapter’s side A data register is mapped to location &HFF00 in memory. The right joystick’s fire button unsets bit 0 when it’s pressed, and sets it when it isn’t. The left joystick’s fire button unsets bit 1 when it’s pressed, and sets it when it isn’t. For example, P = PEEK(&HFF00) : F = NOT (P AND 1) sets F to 1if the right fire button is pressed.

Yes, I’m still technically BASIC coding, but I’ve learned a lot about the architecture of the machine and applied advanced mathematical concepts (I was single-digit years old) to improve my BASIC-coded software.

Much more importantly, I’ve become excited to understand this stuff and apply it to the programs I write, and I’m enthusiastic to share those programs and find other people who want to share their programs. There’s an inert plastic box plugged into an old TV in the otherwise-unused middle room of the house, and I can make it do what I want.

The BASIC-haters pooh-poohed that notion: OK you can make your little toys, but real software is written in machine code. Citation needed I’m sure, but I suspect that the advent of Visual BASIC meant that far more real software was being written in BASIC than in machine code even by the 1990s, a decade in which that first generation of micros was redundant but not quite obsolete (the last Commodore 64 was sold in April 1994, because Commodore went bankrupt).

However, the people who learned BASIC just picked up other tools and continued the journey they had been inspired to set out on. I live in the UK and many of the professional programmers I meet, when they’re between about ten years older than me and three years younger than me, are in that BASIC generation. We typed programs on our micros, we learned how the computers worked, and then we transferred that understanding and that exploration to other systems and other programming languages. We watched TRON, and learned to fight for the users: being scared by computers because they refuse to open the pod bay doors was for our parents.

If vibe coding gives this generation that same sense of wonder and empowerment, of being able to control a device that has hitherto only done what other people charge you to do, if it starts that same journey of learning and applying skills, and of sharing that knowledge, then it really doesn’t matter whether you think it’s OK for real software. It really doesn’t matter whether it ends up being a great tool or not.

Posted in AI, history | Leave a comment

Essence and accident in language model-assisted coding

In 1986, Fred Brooks posited that there was “no silver bullet” in software engineering—no tool or process that would yield an order-of-magnitude improvement in productivity. He based this assertion on the division of complexity into that which is essential to the problem being solved, and that which is an accident of the way in which we solve the problem.

In fact, he considered artificial intelligence of two types: AI-1 is “the use of computers to solve problems that previously could only be solved by applying human intelligence” (here Brooks quotes David Parnas), to Brooks that is things like speech and image recognition; and AI-2 is “The use of a specific set of programming techniques [known] as heuristic or rule-based programming” (Parnas again), which to Brooks means expert systems.

He considers that AI-1 isn’t a useful definition and isn’t a source of tackling complexity because results typically don’t transfer between domains. AI-2 contains some of the features we would recognize from today’s programming assistants—finding patterns in large databases of how software has been made, and drawing inferences about how software should be made. The specific implementation technology is very different, but while Brooks sees that such a system can empower an inexperienced programmer with the experience of multiple expert programmers—“no small contribution”—it doesn’t itself tackle the complexity in the programming problem.

He also writes about “automatic programming” systems, which he defines as “the generation of a program for solving a problem from a statement of the problem specifications” and which sounds very much like the vibe coding application of language model-based coding tools. He (writing in 1986, remember) couldn’t see how a generalization of automatic programming could occur, but now we can! So how do they fare?

Accidental complexity

Coding assistants generate the same code that programmers generate, and from that perspective they don’t reduce accidental complexity in the solution. In fact, a cynical take would be to say that they increase accidental complexity, by adding prompt/context engineering to the collection of challenges in specifying a program. That perspective assumes that the prompt is part of the program source, but the generated output is still inspectable and modifiable, so it’s not clearly a valid argument. However, these tools do supply the “no small contribution” of letting any one of us lean on the expertise of all of us.

In general, a programming assistant won’t address accidental complexity until it doesn’t generate source code and just generates an output binary instead. Then someone can fairly compare the complexity of generating a solution by prompting with generating a solution by coding; but they also have to ask whether they have validation tools that are up to the task of evaluating a program using only the executable.

Or the tools can skip the program altogether, and just get the model to do whatever tasks people were previously specifying programs for. Then the accidental complexity has nothing to do with programming at all, and everything to do with language models.

Essential complexity

Considering any problem that we might want to write software for, unless the problem statement itself involves a language model then the language model is entirely unrelated to the problem’s essential complexity. For example, “predict the weather for the next week” hides a lot of assumptions and questions, none of which include language models.

That said, these tools do make it very easy and fast to uncover essential complexity, and typically in the cursed-monkey-paw “that’s not what I meant” way that’s been the bane of software engineering since its inception. This is a good thing.

You type in your prompt, the machine tells you how absolutely right you are, generates some code, you run it—and it does entirely the wrong thing. You realize that you needed to explain that things work in this way, not that way, write some instructions, generate other code…and it does mostly the wrong thing. Progress!

Faster progress than the old thing of specifying all the requirements, designing to the spec, implementing to the design, then discovering that the requirements were ambiguous and going back to the drawing board. Faster, probably, even than getting the first idea of the requirements from the customer, building a prototype, and coming back in two weeks to find out what they think. Whether it’s writing one to throw away, or iteratively collaborating on a design[*], that at least can be much faster now.

[*] Though note that the Spec-Driven Development school is following the path that Brooks did predict for automatic programming (via Parnas again): “a euphemism for programming with a higher-level language than was presently available to the programmer”.

Posted in AI, software-engineering, tool-support | 2 Comments

Tony Hoare and negative space

The Primeagen calls it Negative-Space Programming: using assertions to cut off the space of possible programs, leaving only the ones you believe are possible given your knowledge of a program’s state at a point.

Tony Hoare just called it “logic”, which is why we now call it Hoare Logic. OK, no he didn’t. He called it An axiomatic basis for computer programming, but that doesn’t make the joke work. The point is, this is how people were already thinking about program design in 1969 (and, honestly, for a long time before). I’m glad the Primeagen has discovered it, I just don’t think it needs a new name.

Imagine you know the program’s state (or, the relevant attributes of that state) at a point in your program, and you can express that as a predicate—a function of the state that returns a Boolean value—P. In other words, you can assert that P is true at this point. The next thing the computer does is to run the next instruction, C. You know the effect that C has on the environment, so you can assert that it leaves the environment in a new state, expressed by a new predicate, Q.

Tony Hoare writes this as a Hoare Triple, {P}C{Q}. We can write it out as a formulaic sentence:

Given P, When C, Then Q

Given-when-then looks exactly like a description of a test, because it is; although our predicates might not be specific states for the program to be in as in a specific unit test, they might describe sets of states for the program to be in, as in a property-based test.

We can also use the fact that the computer runs program statements in sequence to compose these triples. Given P, when C, then Q; given Q, when D, then R; given R and so on. Now, if we arrange our program so that it only runs C when P is true, and it always runs D next, and so on, we can use the composition rule to say something about the compound statement C;D;etc. If we wrap that compound statement up in a function f(), then we get to say this:

If the caller of f() makes sure that the precondition for calling the function is true before making the call, then f() guarantees that the postcondition is true after the call returns.

This combination of responsibilities and guarantees is a contract. If you start with the contract and work to design an implementation that satisfies the contract, then you are designing by contract. You’re following a discipline of programming, as Dijkstra put it.

Oh, speaking of Dijkstra. We said that you need to know that the computer always runs this part of our program in the sequential way, with the stated precondition, that means it isn’t allowed to jump partway into it with some other, different initial state. If we allowed that then we’d have a much harder time constructing our P at a point in the program, because it would be whatever we think it is if the program follows its normal path, or whatever it is if the program goes to this point from over here, or whatever it is if the program goes to this point from over there, and so on. And our Q becomes whatever we think it is if the computer runs C after the normal path, or whatever it is if the computer goes to this point from over here then runs C, and so on.

Thinking about all of that additional complexity, and persevering with designing a correct program, could be considered harmful. To avoid the harm, we design programs so that the computer can’t go to arbitrary points from wherever the programmer wants. All to support logically-designed programming, or “negative-space programming” as I suppose we’re meant to call it now.

Posted in history, software-engineering | Tagged | Leave a comment