A/B test yourself vs code agent | Artur Kęsik

Why

There is no denying, the LLM as a code-generation technology are amazing. Yet with overuse of LLMs, there are significant trade-offs to consider.

First, there is a use it or lose it phenomena with almost all of the human skills. The more you use LLMs to code, the faster your programming skills will deteriorate.

Second, there is an ongoing discussion about the quality of generated code. Some people swear it’s better than programmers, others think it’s producing a junior level code.

Especially if you are in the second camp, and decide to limit your use of LLMs, the technology is progressing and you might want to keep an eye on a progress of each model.

Luckily, there is an easy way to see if the code generation is up to your standards. Just run a simple A/B test with a given feature.

The setup

It’s really easy:

Work on one feature 100% without LLMs: Just you and a feature. Get deep and intimate with the code base. Log the time it took you to make it work.
Commit code, and reset branch: Store code for later, but go back to clean state.
Let AI do the same work: Obviously don’t lead AI too much. Try to remember how naive you were before you coded the feature. It might be even better to write a prompt as a step 0, but send it to AI in this step.
Compare the code: Deep inside decide - are you happy with what AI gave you? Was it significantly easier? Is the code produced cleaner and more maintainable in the long run?

In my experience, the bug fixes will be the most telling. But be broad and experiment.

My experience

I don’t want to taint this blogpost with my opinions too much. Always think for yourself schmuck. And run experiments.