Opinionated Read: How AI Impacts Skill Formation

The abstract to this preprint, by two authors both associated with Anthropic, makes the claim “We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation – particularly in safety-critical domains.”

The first thing to appreciate is that this idea of “safety-critical domains” does a lot of heavy lifting when it comes to software professionalism—on the one hand, engineers say that while (intervention that is perhaps validated in controlled trials or in experience reports but not the way that engineers like to work) is clearly something that those safety-critical folks should concern themselves with, it’s not relevant to (domain that includes the engineers’ work). On the other hand, professional organisations in the field of computing refuse to engage with the idea that a software engineer should be expected to learn the software engineering body of knowledge precisely because it doesn’t have anything to say about how to build safety-critical software.

Now what is safety-critical software? Or, if you can’t answer that, what software isn’t safety critical? The book Moral AI tells the story of a self-driving car that collided with, and killed, a pedestrian who was pushing a bicycle across the road. The driver (for lack of an agreed and accurate word) reports streaming an episode of The Voice in the background while checking in to her employer’s work chat channels on the car’s central console. Is the car’s autonomous driving system safety-critical in this context? What about the built-in collision avoidance system, that the employer had disabled in this vehicle? How about the streaming software, or the chat application, or the operating systems on which they run? All of these contributed to a fatality, what makes any of them safety-critical or not?

The second thing is that the claim in the abstract is about learning outcomes, skill formation, and efficiency gains. We need to go into reading this paper keeping those terms in mind, and asking ourselves whether this is actually what the authors discuss. Because we care about what they did and what they found, and aren’t so worried about the academic context in which they want to present this work, let’s skip straight to section 4, the method.

What did they do?

Unfortunately, we don’t learn a lot about the method from their method section, certainly not enough to reproduce their results. They tell us that they use “an online interview platform with an AI chat interface”, but not which one. The UI of that platform might be an important factor in people’s cognition of the code (does it offer syntax highlighting, for example? Does it offer a REPL, or a debugger? Can it run tests?) or their use of AI (does it make in-place code edits?).

In fact when we read on we find that in a pilot study they found that such a platform (they call it P1) was unsuitable and that they switched to another, P2. Choosing a deliberately uncharitable reading, P1 is probably Anthropic’s regular platform for interviewing engineering candidates, and management didn’t want their employees saying it has problems because Silicon Valley culture is uniquely intransigent when it comes to critiquing their interviewing practices (if you think the interview system is broken, you’re saying there’s a chance that I shouldn’t have been given my job, and that’s too horrible to contemplate). Whether that’s true or not, we’re left not knowing what a participant actually saw.

The interviewing platform has “AI”, and the authors tell us the model (GPT-4o); this piece of information leads me to put more weight on my hypothesis about the interview platform name. It subtly reframes the paper from “AI has a negative impact on skills acquisition” to “our competitor’s product has a negative impact on skills acquisition”; why mention this one product if you took a principled position on anonymising product names? Unfortunately that’s all we get. “The model is prompted to be an intelligent coding assistant.” What does that mean? Prompted by whom? Did the experimenters control the system prompt? What was the system prompt? Was the model capable of tool use? What tools were available? Could participants modify the system prompt?

So now what we do know; 52 people (said in the prose to be split 26 in the “no-AI” control group, and 26 in the “AI access” treatment group; table 1 doesn’t quite add up that way) were given a 10 minute coding challenge, then a 35-minute task to make use of a particular Python library, either with or without AI assistance. Finally, they have 25 minutes to answer questions related to the task: an appendix lists examples of the kinds of questions used, but not the actual question script. This is another factor that negatively impacts replicability.

What did they find?

There’s no significant difference in task completion time between the two groups (people who used AI, and people who didn’t use AI). That is, while the mean task completion time is slightly lower for the AI group, the spread of completion times is such that this doesn’t indicate a meaningful effect.

Overall, people who used AI did less well on the test that they took immediately after completing the task, and this is a meaningful effect.
However, looking at their more detailed results (figure 7), it seems that among developers with less Python experience (1-3 years), the task completion time was vastly improved by AI access, and the outcome on the quiz was not significantly different.

Remember the sentence in the abstract was “Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library.” A different observation is “Participants with only a few years of Python experience showed significant productivity improvements, at no cost to learning the library”.

But does the quiz provide evidence of “having learned the library”? It’s a near-immediate test to recall knowledge about a task that the participants had just completed. What would the researchers have found if they waited four hours, or one day, or one week, to give the test? What would they have found if they set “learning the library” as an explicit task, and gave people time (with or without access to AI) to study? Would it make a difference if this study time was undertaken before participants undertook the coding task, or after? The authors find that some participants used significant time asking the AI assistant questions about the problem. In this way, they measure the total time taken to learn and solve the coding problem, in a situation where you’ve been given a total of 35 minutes for both.

The authors performed some qualitative analysis of their data. They find that people in the no-AI condition encounter more errors (syntax or API misuse) than people who use AI: this should be an interesting result, and a challenge to anyone who prejudges AI-generated code as “slop”. In this situation, it would seem that human-generated code is sloppier (at least in the first few minutes after it’s created).

They identify six AI interaction patterns among their treatment group, and that people who used three of the patterns achieved better results (though we can’t comment on significance as we no longer have statistics) than the control group on the quiz outcome, without impact on “productivity”. As someone who has attached their wagon to the horse of intentional use of AI to improve software engineering skills, this should give me the warm fuzzies. In the context of the control validity questions of the study, I don’t know that they’ve necessarily demonstrated such improvement.

At this point I have another confounding factor to add to these results: the researchers questioned participants on their programming experience, but not on their experience using AI assistants (beyond recruiting people with non-zero experience). Do the adoption of these patterns correlate with more experience using AI? Do people with more experience using AI get more productivity when they use AI? We can’t tell.

And we also can’t say anything about the skill of using an AI assistant itself. Participants are asked about their understanding of the Python library, and the authors transfer their performance answering these questions into a measure of “skill acquisition” learned in using the library. Is that the skill they exercised? Do the quiz answers tell us anything about that skill? If participants were asked a week later to complete a related task, would their performance correlate with the quiz results? Is using the Python library even a useful skill to have?

The authors observed that one pattern performed far worse on both task-completion time and quiz responses than all the others, and this was “Iterative AI Debugging”: verifying or debugging code using the AI. This result isn’t surprising, because the pattern represents using the model to evaluate the logic embodied in the code, and language models don’t evaluate logic. They’re best suited to what used to be called “caveman debugging” where you use print statements to turn dynamic behaviour into a sequence of text messages—because the foodstuff of the model is sequences of text. They don’t evaluate the internal state and control flow of software, so asking them to make changes based on understanding that internal state or control flow is unlikely to succeed. However, given the small amount of data on this debugging pattern, this is really a plausible conjecture worthy of follow up, not a finding.

This preprint claims that using AI assistants to perform a task harms task-specific skill acquisition without significantly improving completion time. What it shows is that using AI assistants to perform a task leads to a broad distribution of ability to immediately answer questions related to the completion of the task, with an overall slight negative effect, without significantly improving completion time. The relation of the acquired knowledge to knowledge retention, or to skill, remains unexplored.

Opinionated Read: How AI Impacts Skill Formation

What did they do?

What did they find?

About Graham

Leave a Reply

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

Support This Site

FSF