LLMs are Pretty Bad “Teammates”

2 minute read

Published:

I’ve gone back to Kitpicks work over winter break.

The app is quasi-functional in that it will allow users to actually add RSS feeds and will fetch those feeds. But with the introduction of more complex logic, the app is starting to pick up a slight aroma of “legacy code.” I found myself playing whack-a-mole with regressions and having to follow around some spaghetti strands when debugging issues.

So, in the past few days, I’ve started getting a feel for real software engineering with AI, not just vibe-coding a user interface.

In these tasks, I’ve found Copilot to be sometimes magical and sometimes insanely frustrating. By far the biggest frustration is Copilot’s tendency to get tests passing by weakening the assertions.

I feel like I’ve seen a lot of takes in the past day or two expressing nearly the full spectrum of attitudes towards AI-assisted engineering1.

On the front page of Hacker News alone, at this moment, there’s a blog post which argues that AI is doomed because it can’t really “think”, and another which fully embraces AI but argues that traditional software engineering practices make it so you can run AI with almost no supervision. A couple days ago I saw on HN a post which argued for the inevitability of slop and lack of rigor, and that we must “romanticize” this. Today, my Google Discover feed recommended me a post summarizing an interview with Martin Fowler, which takes the middle ground and argues that AI is highly useful for both analyzing and generating code but can’t be let loose unsupervised.

I’m not sure I have much to add to all this discourse, other than to throw a +1 in the direction of Mr. Fowler based on my experiences so far. AI does all kinds of crazy random stuff that absolutely degrades the codebase and causes regressions. Agent instructions help, but only a little bit. I’ve found that even the most strongly worded instructions are often ignored.

If LLMs are meant to replace engineers, they are replacing the most frustrating and flaky engineers you’ve ever worked with, who produce enough good code to somehow stay employed despite seeming to have zero common sense or teamwork skills.

But still, for most tasks, I think it saves time to give Copilot one or two attempts to solve a problem before stepping in and doing it manually. I have a nearly fully-featured app up and running:

Kitpicks Design 12-30

and honestly the vast vast majority of its code was generated by Copilot (incrementally, under close supervision by me).

  1. Minus the fully skeptical attitudes which say that AI is completely useless. I think that attitude is increasingly marginalized, and probably increasingly less credible, given how easy it is to make AI write code which, at least, does something interesting.